Learning Bigrams from Unigrams
Add to your list(s)
Download to your calendar using vCal
If you have a question about this talk, please contact Zoubin Ghahramani.
Traditional wisdom holds that once documents are turned into bag-of-words (unigram count) vectors, word orders are completely lost. We introduce an approach that, perhaps surprisingly, is able to learn a bigram language
model from a set of bag-of-words documents. At its heart, our approach is an EM algorithm that seeks a model which maximizes the regularized marginal likelihood of the bag-of-words documents. In experiments on seven corpora, we observed that our learned bigram language models:
- achieve better test set perplexity than unigram models trained on the same bag-of-words documents, and are not far behind “oracle bigram models” trained on the corresponding ordered documents;
- assign higher probabilities to sensible bigram word
pairs; and
- improve the accuracy of ordered document
recovery from a bag-of-words.
Our approach opens the door to novel phenomena, for example, privacy leakage from index files.
This work was originally presented at ACL 2008 , and is in collaboration with Xiaojin “Jerry” Zhu (UW), Michael Rabbat (McGill), and Rob Nowak (UW).
Bio: Andrew Goldberg is a 4th year PhD student at UW-Madison. He is broadly interested in statistical machine learning and natural language processing. He specializes in semi-supervised learning and is also part of a “text-to-picture synthesis” project that combines machine learning, NLP , and computer vision for aiding communication.
This talk is part of the Machine Learning @ CUED series.
This talk is included in these lists:
Note that ex-directory lists are not shown.
|