Log in

Cambridge users (raven) details

Other users details

No account? details

Information on

Subscribing to talks details

Finding a talk details

Adding a talk details

Disseminating talks details

Help and Documentation details

Mitigating the Risks of Metastable Failures in Distributed Systems

Add to your list(s) Download to your calendar using vCal

Aleksey Charapko, University of New Hampshire
Friday 28 March 2025, 15:30-16:30
Computer Lab, FW11 and Online (Teams link will appear before the talk).

If you have a question about this talk, please contact Richard Mortier.

Teams link—click here

Metastable failures refer to a class of catastrophic system failures that cause a permanent, self-sustaining overload of the impacted system. Distinguishing characteristics of metastable failures are the initial trigger that temporarily overloads the system and the sustaining effect that kicks in due to such overload and keeps the systems in the overloaded state, even after the initial trigger is fixed. Once in this permanently overloaded state, called the metastable failure state, the system is perpetually busy but unable to complete any useful work until drastic manual measures, such as restarting the system, are taken. Metastable failures have led to several prominent cloud outages in recent years.

This seminar explores strategies for mitigating the risks of metastable failures in distributed systems. First, we focus on the practical robustness of algorithms and systems, accounting for the performance cost of fault tolerance and error handling. Then, we look at the importance of identifying and protecting vulnerable components in large distributed systems to tame the sustaining effects and prevent the sustaining mechanisms from developing into a positive feedback loop. Finally, we discuss “metastable failure poisoning”—a feedback mechanism that spreads the failure across seemingly isolated systems or components.

Bio: Aleksey Charapko is an assistant professor at the University of New Hampshire. He received his Ph.D. from the University at Buffalo, working on consensus algorithms and state machine replication. Now, Aleksey is broadly interested in distributed systems’ performance, reliability, and efficiency. Aleksey has received several awards and research grants, most recently an NSF CAREER award for the “metastable failures” research. In addition to his academic endeavors, Aleksey has over a decade of engineering experience ranging from freelance to big tech to consulting.

This talk is part of the Computer Laboratory Systems Research Group Seminar series.

This talk is included in these lists:

Note that ex-directory lists are not shown.

Log in

Information on

Mitigating the Risks of Metastable Failures in Distributed Systems

This talk is included in these lists:

Other lists

Other talks