University of Cambridge > > NLIP Seminar Series > Design decisions in web corpus construction and their impact on distributional semantic models

Design decisions in web corpus construction and their impact on distributional semantic models

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Tamara Polajnar.

The World Wide Web has become increasingly popular as a source of linguistic data, not only within the NLP communities, but also with theoretical linguists facing problems of data sparseness or data diversity. However, for a variety of reasons, commercial search engines are not normally suitable when it comes to collecting data for linguistic research (see, for example, Kilgarriff 2007). An obvious alternative to ‘Googleology’ consists in building static corpora from web documents, possibly adding layers of linguistic annotation, and querying these corpora with tools geared towards the needs of linguists. However, constructing a relatively ‘clean’ corpus of web texts from html-documents usually involves all kinds of design decisions (e.g., concerning sampling strategy, filtering, de-duplication, normalization). The impact of such decisions on the characteristics of the final corpus has received relatively little attention so far. This talk focuses on the processing steps that have been applied in building most of the large web corpora available today, such as the WaCKy corpora (Baroni et al. 2009) and the COW corpora (Schäfer and Bildhauer 2012). I will discuss to what extent these steps involve arbitrary decisions and show how some of these can be avoided (or at least, shifted from the corpus builders to the corpus users). Finally, I tentatively explore the impact of such decisions on distributional semantic models based on the resulting corpora.


Baroni M., Bernardini, S., Ferraresi A., and Zanchetta, E. 2009. The WaCky Wide Web: A Collection of Very Large Linguistically Processed Web-Crawled Corpora. Language Resources and Evaluation 43(3), 209-226.

Kilgarriff, A. 2006. Googleology is Bad Science. Computational Linguistics 33(1), 147-151.

Schäfer, R. and Bildhauer, F. 2012. Building Large Corpora from the Web Using a New Efficient Tool Chain. In Nicoletta Calzolari et al. (eds.), Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12). European Language Resources Association, 486–493.

This talk is part of the NLIP Seminar Series series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.


© 2006-2024, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity