University of Cambridge > Talks.cam > NLIP Seminar Series > Better Together: Large Monolingual, Bilingual and Multimodal Corpora in NLP

Better Together: Large Monolingual, Bilingual and Multimodal Corpora in NLP

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Thomas Lippincott.

In this talk, I contrast NLP systems trained using three types of corpora: (1) annotated (e.g. the Penn Treebank), (2) bitexts (e.g. Europarl), and (3) unannotated monolingual (e.g. Google N-grams). Size matters: (1) is a million words, (2) is potentially billions of words and (3) is potentially trillions of words. I focus on the problem of finding the syntactic structure of complex noun phrases. The unannotated monolingual data is helpful when ambiguity can be resolved through associations among the lexical items. The bilingual data is helpful when ambiguity can be resolved by the order of words in the translation. I show how to iteratively improve the performance of a noun phrase parser by co-training over the monolingual and bilingual feature views. The co-trained system achieves state-of-the-art results (both within and across domains) starting from only a handful of labeled examples. I also describe NLP systems that successfully exploit the huge volume of labeled images on the web. If a picture’s worth a thousand words, then online visual data might comprise the biggest linguistic corpus of all.

This talk is part of the NLIP Seminar Series series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

© 2006-2024 Talks.cam, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity