Log in

Cambridge users (raven) details

Other users details

No account? details

Information on

Subscribing to talks details

Finding a talk details

Adding a talk details

Disseminating talks details

Help and Documentation details

Learning Bigrams from Unigrams

Add to your list(s) Download to your calendar using vCal

Andrew B. Goldberg (University of Wisconsin, Madison)
Tuesday 23 September 2008, 14:00-15:00
Engineering Department, CBL Room 438.

If you have a question about this talk, please contact Zoubin Ghahramani.

Traditional wisdom holds that once documents are turned into bag-of-words (unigram count) vectors, word orders are completely lost. We introduce an approach that, perhaps surprisingly, is able to learn a bigram language model from a set of bag-of-words documents. At its heart, our approach is an EM algorithm that seeks a model which maximizes the regularized marginal likelihood of the bag-of-words documents. In experiments on seven corpora, we observed that our learned bigram language models:

achieve better test set perplexity than unigram models trained on the same bag-of-words documents, and are not far behind “oracle bigram models” trained on the corresponding ordered documents;
assign higher probabilities to sensible bigram word pairs; and
improve the accuracy of ordered document recovery from a bag-of-words. Our approach opens the door to novel phenomena, for example, privacy leakage from index files.

This work was originally presented at ACL 2008 , and is in collaboration with Xiaojin “Jerry” Zhu (UW), Michael Rabbat (McGill), and Rob Nowak (UW).

Bio: Andrew Goldberg is a 4th year PhD student at UW-Madison. He is broadly interested in statistical machine learning and natural language processing. He specializes in semi-supervised learning and is also part of a “text-to-picture synthesis” project that combines machine learning, NLP , and computer vision for aiding communication.

This talk is part of the Machine Learning @ CUED series.

This talk is included in these lists:

Note that ex-directory lists are not shown.

Log in

Information on

Learning Bigrams from Unigrams

This talk is included in these lists:

Other lists

Other talks