University of Cambridge > > NLIP Seminar Series > Making the World's Scientific Information (More) Organized, Accessible, and Usable

Making the World's Scientific Information (More) Organized, Accessible, and Usable

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Laura Rimell.

Web portals like Google Scholar and ScienceDirect have revolutionized access to scientific information by making it possible to identify relevant papers via keyword search, and then to browse them on-line. However, as scientific information continues to grow exponentially, and as (e-)science embraces automation, keeping abreast of and exploiting the information in these papers effectively is becoming impossible.

I’ll describe a prototype scientific literature search and information extraction system, developed in collaboration with the FlyBase (Fruit Fly Genomics) curation team, designed to support very fine-grained but intuitive querying and access to information in a collection of papers. FlySearch indexes annotated papers and supports integrated search over individual sentences and images, aggregating information across the collection. For example, one can search captions describing a specific gene regulating a biological process and restrict the associated images to a specific body part.

The system rests on a processing pipeline in which a Portable Document Format paper is first converted to Scientific eXtensible Mark-up Language, preserving its logical structure but, for example, separating images, tables, and references from running text, and then applying specialized text and image processing tools to the different components of the paper. These are able to compute image similarity, recognize gene names, facts about genes, and their relationships to other biological entities, etc. They have been designed to be as generic as possible to facilitate application to different areas of science. Where they require domain-specific tuning they have been developed using semi-supervised machine learning methods to minimize such costs.

Initial results suggest that many aspects of the user interface need refinement but the underlying search functionality is able to improve speed and precision significantly over keyword-based document-level search. Nevertheless, many further challenges remain, of which perhaps the most pressing is handling more forms of contextually-mediated variant ways of expressing the same meaning, but we would also like to be able to go beyond finding and extracting relations between biological entitites and, for example, support (e.g. temporal) reasoning about biological events.

This talk is part of the NLIP Seminar Series series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.


© 2006-2020, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity