COOKIES: By using this website you agree that we can place Google Analytics Cookies on your device for performance monitoring. |
University of Cambridge > Talks.cam > NLIP Seminar Series > Design decisions in web corpus construction and their impact on distributional semantic models
Design decisions in web corpus construction and their impact on distributional semantic modelsAdd to your list(s) Download to your calendar using vCal
If you have a question about this talk, please contact Tamara Polajnar. The World Wide Web has become increasingly popular as a source of linguistic data, not only within the NLP communities, but also with theoretical linguists facing problems of data sparseness or data diversity. However, for a variety of reasons, commercial search engines are not normally suitable when it comes to collecting data for linguistic research (see, for example, Kilgarriff 2007). An obvious alternative to ‘Googleology’ consists in building static corpora from web documents, possibly adding layers of linguistic annotation, and querying these corpora with tools geared towards the needs of linguists. However, constructing a relatively ‘clean’ corpus of web texts from html-documents usually involves all kinds of design decisions (e.g., concerning sampling strategy, filtering, de-duplication, normalization). The impact of such decisions on the characteristics of the final corpus has received relatively little attention so far. This talk focuses on the processing steps that have been applied in building most of the large web corpora available today, such as the WaCKy corpora (Baroni et al. 2009) and the COW corpora (Schäfer and Bildhauer 2012). I will discuss to what extent these steps involve arbitrary decisions and show how some of these can be avoided (or at least, shifted from the corpus builders to the corpus users). Finally, I tentatively explore the impact of such decisions on distributional semantic models based on the resulting corpora. References Baroni M., Bernardini, S., Ferraresi A., and Zanchetta, E. 2009. The WaCky Wide Web: A Collection of Very Large Linguistically Processed Web-Crawled Corpora. Language Resources and Evaluation 43(3), 209-226. Kilgarriff, A. 2006. Googleology is Bad Science. Computational Linguistics 33(1), 147-151. Schäfer, R. and Bildhauer, F. 2012. Building Large Corpora from the Web Using a New Efficient Tool Chain. In Nicoletta Calzolari et al. (eds.), Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12). European Language Resources Association, 486–493. This talk is part of the NLIP Seminar Series series. This talk is included in these lists:
Note that ex-directory lists are not shown. |
Other listsDifferent Views: E&D Lunchtime Sessions Extragalactic Gathering Cambridge Centre for Political Thought Centre Family Research/Psych Type the title of a new list here Cambridge University Expeditions SocietyOther talksTowards a whole brain model of perceptual learning Can land rights prevent deforestation? Evidence from a large-scale titling policy in the Brazilian Amazon. Accelerating the control of bovine Tuberculosis in developing countries Graph Convolutional Networks for Natural Language Processing and Relational Modeling EMERGING EPIGENETICS: DETECTING & MODIFYING EPIGENETICS MARKS Single Cell Seminars (November) Fumarate hydratase and renal cancer: oncometabolites and beyond 'Cryptocurrency and BLOCKCHAIN – PAST, PRESENT AND FUTURE' Coin Betting for Backprop without Learning Rates and More Bayesian deep learning |