University of Cambridge > Talks.cam > Machine Learning @ CUED > Learning Bigrams from Unigrams

Learning Bigrams from Unigrams

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Zoubin Ghahramani.

Traditional wisdom holds that once documents are turned into bag-of-words (unigram count) vectors, word orders are completely lost. We introduce an approach that, perhaps surprisingly, is able to learn a bigram language model from a set of bag-of-words documents. At its heart, our approach is an EM algorithm that seeks a model which maximizes the regularized marginal likelihood of the bag-of-words documents. In experiments on seven corpora, we observed that our learned bigram language models:
  1. achieve better test set perplexity than unigram models trained on the same bag-of-words documents, and are not far behind “oracle bigram models” trained on the corresponding ordered documents;
  2. assign higher probabilities to sensible bigram word pairs; and
  3. improve the accuracy of ordered document recovery from a bag-of-words. Our approach opens the door to novel phenomena, for example, privacy leakage from index files.

This work was originally presented at ACL 2008 , and is in collaboration with Xiaojin “Jerry” Zhu (UW), Michael Rabbat (McGill), and Rob Nowak (UW).

Bio: Andrew Goldberg is a 4th year PhD student at UW-Madison. He is broadly interested in statistical machine learning and natural language processing. He specializes in semi-supervised learning and is also part of a “text-to-picture synthesis” project that combines machine learning, NLP , and computer vision for aiding communication.

This talk is part of the Machine Learning @ CUED series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

© 2006-2025 Talks.cam, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity