Adapting a WSJ-trained Lexicalized-Grammar Parser to New Domains
- đ¤ Speaker: Laura Rimell, Oxford University
- đ Date & Time: Friday 14 November 2008, 12:00 - 13:00
- đ Venue: SW01, Computer Laboratory
Abstract
In this talk I will describe some experiments on adapting the C&C CCG parser to new domains. The parser was originally developed using CCGbank, the CCG version of the Penn Treebank, and is therefore tuned to newspaper text. The two new domains we consider are (1) biomedical abstracts and (2) questions for a QA system (using the term “domain” somewhat loosely in the latter case).
The porting approach we use is to train the parser at lower levels of representation than full syntactic derivations. The lexicalized nature of CCG (in which words are assigned syntactic categories that include subcategorization information) makes it possible to use a level of representation intermediate between POS tags and full derivations. For the biomedical data, we find that simply retraining the POS tagger leads to a large improvement in performance, and that using annotated data at the intermediate CCG lexical category level improves parsing accuracy further. A similar result is obtained for the question data, but the impact of retraining at the CCG lexical category level is much greater. We suggest that this is because the syntax of questions differs more from that of newspaper text than does the syntax of biomedical sentences, and we discuss some measures supporting this idea.
The parsing accuracies obtained for both biomedical and question data are in the same range as those reported for newspaper text, and higher than those previously reported for the biomedical domain on the same evaluation resource. The conclusion is that porting newspaper-trained parsers to new domains may not be as difficult as first thought (at least for parsers which use lexicalized grammars), but we note that different levels of representation may have different impacts on the porting process, depending on the characteristics of the target domain.
Series This talk is part of the NLIP Seminar Series series.
Included in Lists
- All Talks (aka the CURE list)
- bld31
- Cambridge Centre for Data-Driven Discovery (C2D3)
- Cambridge Forum of Science and Humanities
- Cambridge Language Sciences
- Cambridge talks
- Chris Davis' list
- Computer Education Research
- Computing Education Research
- Department of Computer Science and Technology talks and seminars
- Graduate-Seminars
- Guy Emerson's list
- Interested Talks
- Language Sciences for Graduate Students
- ndk22's list
- NLIP Seminar Series
- ob366-ai4er
- PMRFPS's
- rp587
- School of Technology
- Simon Baker's List
- SW01, Computer Laboratory
- Trust & Technology Initiative - interesting events
- yk449
Note: Ex-directory lists are not shown.
![[Talks.cam]](/static/images/talkslogosmall.gif)

Laura Rimell, Oxford University
Friday 14 November 2008, 12:00-13:00