COOKIES: By using this website you agree that we can place Google Analytics Cookies on your device for performance monitoring. |
University of Cambridge > Talks.cam > NLIP Seminar Series > Better Together: Large Monolingual, Bilingual and Multimodal Corpora in NLP
Better Together: Large Monolingual, Bilingual and Multimodal Corpora in NLPAdd to your list(s) Download to your calendar using vCal
If you have a question about this talk, please contact Thomas Lippincott. In this talk, I contrast NLP systems trained using three types of corpora: (1) annotated (e.g. the Penn Treebank), (2) bitexts (e.g. Europarl), and (3) unannotated monolingual (e.g. Google N-grams). Size matters: (1) is a million words, (2) is potentially billions of words and (3) is potentially trillions of words. I focus on the problem of finding the syntactic structure of complex noun phrases. The unannotated monolingual data is helpful when ambiguity can be resolved through associations among the lexical items. The bilingual data is helpful when ambiguity can be resolved by the order of words in the translation. I show how to iteratively improve the performance of a noun phrase parser by co-training over the monolingual and bilingual feature views. The co-trained system achieves state-of-the-art results (both within and across domains) starting from only a handful of labeled examples. I also describe NLP systems that successfully exploit the huge volume of labeled images on the web. If a picture’s worth a thousand words, then online visual data might comprise the biggest linguistic corpus of all. This talk is part of the NLIP Seminar Series series. This talk is included in these lists:
Note that ex-directory lists are not shown. |
Other listsCurrent Issues in Assessment CMIH short course: Image Reconstruction in Biomedical Imaging Mongolia & Inner Asia Studies Unit Seminar SeriesOther talksThe Anne McLaren Lecture: CRISPR-Cas Gene Editing: Biology, Technology and Ethics Improving on Nature: Biotechnology and the Ethics of Animal Enhancement Recent Changes of Korean Government's Strategy on back-end fuel cycle and the changing course of a University Laboratory An African orient? West Africans in World War Two India, 1943-1947 The cardinal points and the structure of geographical knowledge in the early twelfth century Biological and Clinical Features of High Grade Serous Ovarian Cancer TBC Liver Regeneration in the Damaged Liver Lunchtime Talk: Helen's Bedroom Picturing the Heart in 2020 Kiwi Scientific Acceleration on FPGA |