University of Cambridge > Talks.cam > Computer Laboratory Digital Technology Group (DTG) Meetings > Dynamic Causal Monitoring for Distributed Systems

Dynamic Causal Monitoring for Distributed Systems

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Andrew Rice.

Monitoring and troubleshooting distributed systems is notoriously difficult; potential problems are complex, varied, and unpredictable. The de-facto monitoring and diagnosis tools at our disposal today—logs, counters, and metrics—have two important limitations: what gets recorded is defined a priori, and the information is recorded in a component- or machine-centric way, making it hard to correlate events that cross these boundaries.

In this talk I will describe Pivot Tracing, a monitoring framework for distributed systems that addresses both limitations by combining dynamic instrumentation with causal tracing to fundamentally increase the power of both. Through a novel relational operator—the happened-before join, Pivot Tracing gives users, at runtime, the ability to define arbitrary metrics at one point of the system, while being able to select, filter, and group by events meaningful at other parts of the system, even when crossing component or machine boundaries.

I will describe our prototype of Pivot Tracing for Java-based systems, and show some examples of our evaluation on a heterogeneous Hadoop cluster comprising HDFS , HBase, MapReduce, and YARN . We found that Pivot Tracing can effectively identify a diverse range of root causes such as software bugs, misconfiguration, and limping hardware. Further, Pivot Tracing is dynamic, extensible, and enables cross-tier analysis between any inter-operating applications, with low execution overhead.

Bio: Rodrigo Fonseca is an assistant professor at Brown University’s Computer Science Department. He holds a PhD from UC Berkeley, and prior to Brown was a visiting researcher at Yahoo! Research. He is broadly interested in networking, distributed systems, and operating systems. His research involves seeking better ways to build, operate, and diagnose distributed systems, including large-scale internet systems, cloud computing, and mobile computing. He is currently working on dynamic tracing infrastructures for these systems, on new ways to leverage network programmability, and on better ways to manage energy usage in mobile devices.

This talk is part of the Computer Laboratory Digital Technology Group (DTG) Meetings series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

© 2006-2021 Talks.cam, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity