University of Cambridge > Talks.cam > Computer Laboratory Systems Research Group Seminar > A Scalable Approach for Managing Unstructured Information

A Scalable Approach for Managing Unstructured Information

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Eiko Yoneki.

Digital data is being generated in mind-boggling amounts: 15 petabytes—more than 8X the information contained in all US libraries—is created daily. The data landscape is shifting—in addition to structured data in databases, organizations are increasingly dealing with unstructured data such as email, documents, spreadsheets, blogs, Web pages and media files. Unstructured information comprises 80% of most organizations’ information today, and it is growing at an annual rate of 60%. Users are demanding increasing sophistication in the level of information processing that storage and information management systems provide. In addition to the traditional challenges of storing the bytes and searching and classifying the content, they need to leverage their information to provide relevant and timely insights that improve the outcomes of the tasks that they undertake.

In this talk, I will describe recent work at HP Labs on unstructured information management, including SCAN -lite, an extensible framework for gathering structured metadata from unstructured documents, and LazyBase, a scalable database system for ingesting, storing and querying the resulting metadata. Leveraging the high degree of replication present in the enterprise, SCAN -lite uses a two-phase scanning policy (e.g., an initial phase to identify duplicate content and a second phase to do more complicated analysis) that considers client priority classes and idle time to minimize the impact on client foreground workloads. LazyBase is a scalable NoSQL database system that provides extremely high ingest rates, a strong consistency model (as contrasted with eventual consistency), and an explicit per-query tradeoff between freshness and query speed.

Bio: Dr. Kimberly Keeton is a Principal Researcher in the Storage and Information Management Platform group at HP Labs in Palo Alto, CA, USA . Her research focuses on simplifying the management of enterprise information systems, including system design and implementation, modeling, and optimization techniques to automatically design systems to meet users’ (e.g., dependability or information quality) goals.

This talk is part of the Computer Laboratory Systems Research Group Seminar series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

© 2006-2024 Talks.cam, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity