COOKIES: By using this website you agree that we can place Google Analytics Cookies on your device for performance monitoring. |
University of Cambridge > Talks.cam > Machine Learning @ CUED > Modeling Science: Topic models of Scientific Journals and Other Large Document Collections
Modeling Science: Topic models of Scientific Journals and Other Large Document CollectionsAdd to your list(s) Download to your calendar using vCal
If you have a question about this talk, please contact Zoubin Ghahramani. A surge of recent research in machine learning and statistics has developed new techniques for finding patterns of words in document collections using hierarchical probabilistic models. These models are called “topic models” because the word patterns often reflect the underlying topics that permeate the documents; however topic models also naturally apply to data such as images and biological sequences. After reviewing the basics of topic modeling, I will describe two related lines of research in this field, which extend the current state of the art. First, while previous topic models have assumed that the corpus is static, many document collections actually change over time: scientific articles, emails, and search queries reflect evolving content, and it is important to model the corresponding evolution of the underlying topics. For example, an article about biology in 1885 will exhibit significantly different word frequencies than one in 2005. I will describe probabilistic models designed to capture the dynamics of topics as they evolve over time. Second, previous models have assumed that the occurrence of the different latent topics are independent. In many document collections, the presence of a topic may be correlated with the presence of another. For example, a document about sports is more likely to also be about health than international finance. I will describe a probabilistic topic model which can capture such correlations between the hidden topics. In addition to giving quantitative, predictive models of a corpus, topic models provide a qualitative window into the structure of a large document collection. This perspective allows a user to explore a corpus in a topic-guided fashion. We demonstrate the capabilities of these new models on the archives of the journal Science, founded in 1880 by Thomas Edison. Our models are built on the noisy text from JSTOR , an online scholarly journal archive, resulting from an optical character recognition engine run over the original bound journals. (joint work with J. Lafferty) This talk is part of the Machine Learning @ CUED series. This talk is included in these lists:
Note that ex-directory lists are not shown. |
Other listsHulsean Lectures 2012 Isaac Newton Institute Seminar Series Mesopotamian Seminar SeriesOther talksDataFlow SuperComputing for BigData On the elastic-brittle versus ductile fracture of lattice materials Adaptation in log-concave density estimation Hydrogen-Deuterium Exchange Mass Spectrometry Transcription by influenza virus RNA polymerase: molecular mechanisms, cellular aspects and inhibition Neural Networks and Natural Language Processing The Anne McLaren Lecture: CRISPR-Cas Gene Editing: Biology, Technology and Ethics The ‘Easy’ and ‘Hard’ Problems of Consciousness Cambridge Rare Disease Summit 2017 Lecture Supper: James Stuart: Radical liberalism, ‘non-gremial students’ and continuing education Horizontal transfer of antimicrobial resistance drives multi-species population level epidemics Volcanoes and Explosions |