University of Cambridge > > Computational and Systems Biology > Can machines understand the scientific literature?

Can machines understand the scientific literature?

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Samantha Noel.

Perhaps half of the 3 million annual articles, preprints, theses, and gray literature (up 5000/day) are directly relevant to biomedicine (including chemistry, materials, IT, engineering, etc.), and many of the rest (psychology, politics, law, philosophy) are needed to tackle global challenges. We need to index for scientific computation (indexing, searching, data abstraction, and ultimately “Artificial Intelligence”) But the raw material (usually PDF ) is very poorly suited for automatic ingestion and the major search engines are not well suited for science. We will present prototypes of Open tools (software, dictionaries) to extract science in computable (semantic) form. Since science is a global endeavor the tools must be equitable and inclusive and we have included collaborators using several languages (EN, HI, TA, UR, ES, IND ).

The central ontology is based on multilingual Wikidata (ca 100 million Items) which is increasingly subsuming the major biomedical and chemical ontologies and some reference data. The scholarly literature is also formally indexed there (Scholia). Where possible all our entities and many of their relationships are based on Wikidata Items (Q) and Properties (P). Our primary approach is supervised text-mining through faceted dictionaries created from Wikidata SPARQL queries. Current dictionaries include countries, diseases, drugs, chemicals, species, organizations, and can be extended to many other areas (e.g. through Wikipedia categories). Besides text, many documents contain tables and diagrams and it’s also possible to extract data from these such as phylogenetic trees, Forest plots, graphs.

We shall give examples of a variety of several tools that can be run from Jupyter Notebooks and designed to be generic and extensible.

This talk is part of the Computational and Systems Biology series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.


© 2006-2024, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity