COOKIES: By using this website you agree that we can place Google Analytics Cookies on your device for performance monitoring. |
University of Cambridge > Talks.cam > NLIP Seminar Series > Multiword Expressions: Evaluation of Extraction Methods and their Impact on Grammar Engineering
Multiword Expressions: Evaluation of Extraction Methods and their Impact on Grammar EngineeringAdd to your list(s) Download to your calendar using vCal
If you have a question about this talk, please contact Johanna Geiss. In the first part of the talk I focus on the linguistic properties of Multiword Expressions (MWEs), taking a closer look at their lexical, syntactic, as well as semantic characteristics. The term Multiword Expressions has been used to describe expressions for which the syntactic or semantic properties of the whole expression cannot be derived from its parts (cf., Sag et al., 2002), including a large number of related but distinct phenomena, such as phrasal verbs (e.g., “come along”), nominal compounds (e.g., “frying pan”), institutionalised phrases (e.g., “bread and butter”), and many others. Jackendoff (1997) estimates the number of MWEs in a speaker’s lexicon to be comparable to the number of single words. However, due to their heterogeneous characteristics, MWEs present a tough challenge for both linguistic and computational work (cf., Sag et al., 2002). For instance, some MWEs are fixed, and do not present internal variation, such as “ad hoc”, while others allow different degrees of internal variability and modification, such as “spill beans” (“spill several/musical/mountains of beans”). In the second part of the talk I focus on methods for the automatic acquisition of MWEs for robust grammar engineering. First I investigate the hypothesis that MWEs can be detected by the distinct statistical properties of their component words, regardless of their type, comparing various statistical measures, a procedure which leads to extremely interesting conclusions. I then investigate the influence of the size and quality of different corpora, using the BNC and the Web search engines Google and Yahoo. I conclude that, in terms of language usage, web generated corpora are fairly similar to more carefully built corpora, like the BNC , indicating that the lack of control and balance of these corpora are probably compensated by their size. Finally, I show a qualitative evaluation of the results of automatically adding extracted MWEs to existing linguistic resources. To this effect, I first discuss two main approaches commonly employed in NLP for treating MWEs: the words-with-spaces approach which models an MWE as a single lexical entry and it can adequately capture fixed MWEs like “by and large”, and compositional approaches which treat MWEs by general and compositional methods of linguistic analysis, being able to capture more syntactically flexible MWEs, like “rock boat”, which cannot be satisfactorily captured by a words-with-spaces approach, since this would require lexical entries to be added for all the possible variations of an MWE (e.g., “rock/rocks/rocking this/that/his… boat”). On this basis, I argue that the process of the automatic addition of extracted MWEs to existing linguistic resources improves qualitatively, if a more compositional approach to grammar/lexicon automated extension is adopted. This talk is part of the NLIP Seminar Series series. This talk is included in these lists:
Note that ex-directory lists are not shown. |
Other listsGlobal Challenges Research Fund (GCRF) Statistical Laboratory International Year of Statistics Public Lectures CISA Talks - Cambridge International Studies AssociationOther talks'Cambridge University, Past and Present' "Vectorbuilder: Revolutionising Vector Design & Custom Cloning" (25 min seminar) followed by "Advanced Technologies For Rapid Generation Of Custom Designed Animal Models" (25 min seminar) Positive definite kernels for deterministic and stochastic approximations of (invariant) functions Mental Poker Don't be Leeroy Jenkins – or how to manage your research data without getting your whole project wiped out CANCELLED: Alex Goodall: The US Marine Empire in the Caribbean and Central America, c.1870-1920 'Cryptocurrency and BLOCKCHAIN – PAST, PRESENT AND FUTURE' Cambridge-Lausanne Workshop 2018 - Day 2 mTORC1 signaling coordinates different POMC neurons subpopulations to regulate feeding Stereodivergent Catalysis, Strategies and Tactics Towards Secondary Metabolites as enabling tools for the Study of Natural Products Biology Glucagon like peptide-1 receptor - a possible role for beta cell physiology in susceptibility to autoimmune diabetes Modelling seasonal acceleration of land terminating sectors of the Greenland Ice Sheet margin |