University of Cambridge > > Computer Laboratory Systems Research Group Seminar > Towards Grey Fault Tolerant Cloud Systems

Towards Grey Fault Tolerant Cloud Systems

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Amjad.

Building robust, large-scale distributed systems is notoriously challenging. Decades of research have made significant advances in tackling this challenge with mature techniques such as state-machine replication. These techniques usually assume a fail-stop model. Ample real-world evidence, however, suggests that faults in modern cloud infrastructure are often “grey”, in which a component is severely impaired but still appears to be working. These grey failures cannot be effectively detected or handled by existing solutions.

In this talk, I will discuss the grey failure problem. Using real-world examples, we argue that a key trait of the subtle grey failure mode is a form of differential observability. Based on this insight, I will present Panorama, a solution that harnesses observability in large systems to detect grey failures by using instrumentation to convert any system component into an in-situ observer. To further enhance the inherent system observability, I will propose an intrinsic software watchdog abstraction and a tool called OmegaGen that automatically generates customized watchdogs for a given program by using a program reduction technique. I will conclude by outlining some open challenges in making cloud systems grey-fault-tolerant.


Ryan Huang is an Assistant Professor in the Department of Computer Science at Johns Hopkins University. He leads the Ordered Systems Lab at JHU , which conducts research broadly in distributed systems, operating systems, cloud and mobile computing. His work received the best paper award at OSDI 2016 , ASPLOS 2019, NSDI 2020 , and the best paper award nominee at MICRO 2018 . He is a recipient of the NSF CAREER Award (2020). Dr. Huang received a B.S. degree in Computer Science (Economics minor) from Peking University (2010), a P.h.D degree from UC San Diego (2016).

This talk is part of the Computer Laboratory Systems Research Group Seminar series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.


© 2006-2024, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity