Log in

Cambridge users (raven) details

Other users details

No account? details

Information on

Subscribing to talks details

Finding a talk details

Adding a talk details

Disseminating talks details

Help and Documentation details

Deep Web Data: Analysis, Extraction, and Modelling

Add to your list(s) Download to your calendar using vCal

Prof. Pierre Senellart - Telecom Paris Tech
Monday 26 April 2010, 11:00-12:00
Lecture-room large (126 seats) Microsoft Research Ltd, Roger Needham Building, 7 J J Thomson Avenue (Off Madingley Road), CB3 0FB.

If you have a question about this talk, please contact Microsoft Research Cambridge Talks Admins.

Abstract: The traditional way for Web search engines to retrieve and index data from the Web has been to crawl its hyperlink structure. This approach cannot capture data of the deep Web (also known as hidden Web or invisible Web), the huge amount of content available on the Web that lies behind Web forms or Web services. The focus of this talk is to discuss automatic and unsupervised methods for analyzing, extracting, and modelling Web data, given some initial domain of interest. A strong stress will be put in the presentation of applied and theoretical open problems, a solution of which would be of great help for undertanding data of the deep Web. We first introduce classical methods for matching Web forms with concepts from an ontology, and investigate how static analysis of JavaScript programs could be used to improve the quality of the understanding of a HTML form. We next present an unsupervised approach to information extraction over Deep Web result pages and highlight its limitations, insisting in particular on the need for a probabilistic representation of the extracted data. This leads us to consider models for probabilistic trees. After a quick survey of the literature on probabilistic XML , we will discuss interesting questions in verification aspects, in particular connecting the notion of probabilistic database with that of probabilistic schema.

Biography: Dr. Pierre Senellart is an Associate Professor in the Computer Science and Networking department at Télécom ParisTech, the French leading engineering school specialized in information technology. He is an alumni of the École normale supérieure and obtained his M.Sc. (2003) and his Ph.D. (2007) in Computer Science from Université Paris-Sud, studying under the supervision of Serge Abiteboul. Pierre Senellart has published articles in internationally renowned conferences and journals (PODS, AAAI , VLDB Journal, Journal of the ACM , etc.) He has been a member of the program committee of ECML /PKDD, WWW , VLDB, ICDE , a member of the repeatability committee of SIGMOD , and the organizer of the SIGMOD 2010 programming contest. He is also the Information Director of the Journal of the ACM . His research interests focus around theoretical aspects of database management systems and the World Wide Web, and more specifically on the intentional indexing of the deep Web, probabilistic XML databases, and graph mining. He also has an interest in natural language processing, and has been collaborating with SYSTRAN , the leading machine translation company.

This talk is part of the Microsoft Research Cambridge, public talks series.

This talk is included in these lists:

Note that ex-directory lists are not shown.

Log in

Information on

Deep Web Data: Analysis, Extraction, and Modelling

This talk is included in these lists:

Other lists

Other talks