University of Cambridge > Talks.cam > Microsoft Research Cambridge, public talks > Deep Web Data: Analysis, Extraction, and Modelling

Deep Web Data: Analysis, Extraction, and Modelling

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Microsoft Research Cambridge Talks Admins.

Abstract: The traditional way for Web search engines to retrieve and index data from the Web has been to crawl its hyperlink structure. This approach cannot capture data of the deep Web (also known as hidden Web or invisible Web), the huge amount of content available on the Web that lies behind Web forms or Web services. The focus of this talk is to discuss automatic and unsupervised methods for analyzing, extracting, and modelling Web data, given some initial domain of interest. A strong stress will be put in the presentation of applied and theoretical open problems, a solution of which would be of great help for undertanding data of the deep Web. We first introduce classical methods for matching Web forms with concepts from an ontology, and investigate how static analysis of JavaScript programs could be used to improve the quality of the understanding of a HTML form. We next present an unsupervised approach to information extraction over Deep Web result pages and highlight its limitations, insisting in particular on the need for a probabilistic representation of the extracted data. This leads us to consider models for probabilistic trees. After a quick survey of the literature on probabilistic XML , we will discuss interesting questions in verification aspects, in particular connecting the notion of probabilistic database with that of probabilistic schema.

Biography: Dr. Pierre Senellart is an Associate Professor in the Computer Science and Networking department at Télécom ParisTech, the French leading engineering school specialized in information technology. He is an alumni of the École normale supérieure and obtained his M.Sc. (2003) and his Ph.D. (2007) in Computer Science from Université Paris-Sud, studying under the supervision of Serge Abiteboul. Pierre Senellart has published articles in internationally renowned conferences and journals (PODS, AAAI , VLDB Journal, Journal of the ACM , etc.) He has been a member of the program committee of ECML /PKDD, WWW , VLDB, ICDE , a member of the repeatability committee of SIGMOD , and the organizer of the SIGMOD 2010 programming contest. He is also the Information Director of the Journal of the ACM . His research interests focus around theoretical aspects of database management systems and the World Wide Web, and more specifically on the intentional indexing of the deep Web, probabilistic XML databases, and graph mining. He also has an interest in natural language processing, and has been collaborating with SYSTRAN , the leading machine translation company.

This talk is part of the Microsoft Research Cambridge, public talks series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

© 2006-2024 Talks.cam, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity