University of Cambridge > > Computational and Systems Biology > Mining scientific diagrams for semantic information

Mining scientific diagrams for semantic information

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Emily Boyd.

Scientific data is often only reported as diagrams in publications and is effectively destroyed and lost. This data is often critically valuable for other scientists and data abstracting services, and often has to be recreated manually from the diagram at great expense, with waste and error. Examples include plots, charts, and more complex objects such as chemical structure diagrams and phylogenetic (evolutionary) trees.

I shall show how, in favourable circumstances, it is possible to recreate semantic information from diagrams using well-established Computer Vision techniques. These include thresholding, binarization, dilation and thinning, OCR and a variety of domain-specific heuristics. Our Open Source library is based on BoofCV , an Open Java Image processing library, and enhanced with tools useful for scientific documents. Some PDF documents contain vector images and are particularly tractable while others are only pixel images and suffer form overlap, problems of scale and loss of detail

I shall show the application to chemistry and phylogenetics and show where errors and loss occur.

See also my slides from last year at:

This talk is part of the Computational and Systems Biology series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.


© 2006-2023, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity