University of Cambridge > > NLIP Seminar Series > A Probabilistic Framework for Modeling Cross-Lingual Semantic Similarity (out of and in Context) Based on Latent Cross-Lingual Concepts

A Probabilistic Framework for Modeling Cross-Lingual Semantic Similarity (out of and in Context) Based on Latent Cross-Lingual Concepts

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Tamara Polajnar.

Following the ongoing growth of the World Wide Web and its omnipresence in today’s increasingly connected world, users tend to abandon English as the lingua franca of the global network, since more and more content becomes available in their native languages. In addition, given the rapid development of online encyclopedias such as Wikipedia, blogosphere, and online news portals, users have simultaneously generated a huge volume of multilingual text resources. There is a pressing need to provide tools that are able to induce knowledge from the user-generated multilingual text resources and effectively accomplish cross-lingual text processing automatically or with minimum human intervention.

In this talk we address cross-lingual semantic similarity, the task of detecting which words (or more generally, text units) utter similar semantic concepts and convey similar meanings across languages. Models of cross-lingual similarity are typically used to automatically induce bilingual lexicons and have found numerous applications in information retrieval (IR), statistical machine translation (SMT) and other natural language processing (NLP) tasks.

Research into corpus-based cross-lingual models of distributional similarity has focused on building context-insensitive models of cross-lingual similarity that typically rely on external resources such as readily available bilingual lexicons or parallel data to bridge the lexical chasm between two languages. In this talk we follow a completely new research path and present a new probabilistic approach to modeling cross-lingual semantic similarity (out of and in context) that is fully data-driven as it does not rely on any other resources besides a (non-parallel) multilingual corpus. The framework relies on an idea of projecting words and sets of words into a shared latent semantic space spanned by language-pair independent latent cross-lingual semantic concepts (e.g., cross-lingual topics obtained by a multilingual topic model). These latent concepts are induced from a comparable corpus without any additional lexical resources. Word meaning is represented as a probability distribution over the latent cross-lingual concepts, and a change in meaning is represented as a change in the distribution over these latent concepts. The first part of this talk provides a crash course on multilingual text mining models with an emphasis on the multilingual topic modeling approach. These models are utilized to induce the latent cross-lingual concepts from multilingual data. In the second part of the talk, we present a systematic overview of the context-insensitive models of cross-lingual similarity that are built upon the paradigm of latent cross-lingual concepts. We compare these models in the task of bilingual lexicon extraction (BLE). The final part of this talk presents an extension of the probabilistic framework towards context-aware models of cross-lingual similarity. We describe new models of similarity that modulate the isolated out-of-context word representations with contextual knowledge and report our findings on the task of word translation in context.

This talk is part of the NLIP Seminar Series series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.


© 2006-2017, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity