BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Talks.cam//talks.cam.ac.uk//
X-WR-CALNAME:Talks.cam
BEGIN:VEVENT
SUMMARY:Better Together: Large Monolingual\, Bilingual and Multimodal Corp
 ora in NLP - Shane Bergsma - Johns Hopkins University
DTSTART:20111010T110000Z
DTEND:20111010T120000Z
UID:TALK33176@talks.cam.ac.uk
CONTACT:Thomas Lippincott
DESCRIPTION:In this talk\, I contrast NLP systems trained using three type
 s of corpora: (1) annotated (e.g. the Penn Treebank)\, (2) bitexts (e.g. E
 uroparl)\, and (3) unannotated monolingual (e.g. Google N-grams). Size mat
 ters: (1) is a million words\, (2) is potentially billions of words and (3
 ) is potentially trillions of words. I focus on the problem of finding the
  syntactic structure of complex noun phrases. The unannotated monolingual 
 data is helpful when ambiguity can be resolved through associations among 
 the lexical items. The bilingual data is helpful when ambiguity can be res
 olved by the order of words in the translation. I show how to iteratively 
 improve the performance of a noun phrase parser by co-training over the mo
 nolingual and bilingual feature views. The co-trained system achieves stat
 e-of-the-art results (both within and across domains) starting from only a
  handful of labeled examples. I also describe NLP systems that successfull
 y exploit the huge volume of labeled images on the web. If a picture's wor
 th a thousand words\, then online visual data might comprise the biggest l
 inguistic corpus of all.
LOCATION:SW01\, Computer Laboratory
END:VEVENT
END:VCALENDAR
