![]() |
COOKIES: By using this website you agree that we can place Google Analytics Cookies on your device for performance monitoring. | ![]() |
University of Cambridge > Talks.cam > Computer Laboratory Systems Research Group Seminar > Mitigating the Risks of Metastable Failures in Distributed Systems
Mitigating the Risks of Metastable Failures in Distributed SystemsAdd to your list(s) Download to your calendar using vCal
If you have a question about this talk, please contact Richard Mortier. Metastable failures refer to a class of catastrophic system failures that cause a permanent, self-sustaining overload of the impacted system. Distinguishing characteristics of metastable failures are the initial trigger that temporarily overloads the system and the sustaining effect that kicks in due to such overload and keeps the systems in the overloaded state, even after the initial trigger is fixed. Once in this permanently overloaded state, called the metastable failure state, the system is perpetually busy but unable to complete any useful work until drastic manual measures, such as restarting the system, are taken. Metastable failures have led to several prominent cloud outages in recent years. This seminar explores strategies for mitigating the risks of metastable failures in distributed systems. First, we focus on the practical robustness of algorithms and systems, accounting for the performance cost of fault tolerance and error handling. Then, we look at the importance of identifying and protecting vulnerable components in large distributed systems to tame the sustaining effects and prevent the sustaining mechanisms from developing into a positive feedback loop. Finally, we discuss “metastable failure poisoning”—a feedback mechanism that spreads the failure across seemingly isolated systems or components. Bio: Aleksey Charapko is an assistant professor at the University of New Hampshire. He received his Ph.D. from the University at Buffalo, working on consensus algorithms and state machine replication. Now, Aleksey is broadly interested in distributed systems’ performance, reliability, and efficiency. Aleksey has received several awards and research grants, most recently an NSF CAREER award for the “metastable failures” research. In addition to his academic endeavors, Aleksey has over a decade of engineering experience ranging from freelance to big tech to consulting. This talk is part of the Computer Laboratory Systems Research Group Seminar series. This talk is included in these lists:
Note that ex-directory lists are not shown. |
Other listsImagine2027 Medieval Philosophy Reading Group Civic Matter Faculty Research Group @ CRASSHOther talksMackenzie Stuart Lecture 2025: 'The UK's Relationship with the European Union' Mutational signatures: From bytes to bedside The cultural legacy of historical ethnic violence: The impact on access to finance and innovation Late Holocene hunter-gatherer interaction through oxygen and strontium isotopes: cautionary tales, machine learning and mobility in Patagonia A super-linear lower bound for the iteration number of the Weisfeiler-Leman algorithm Who is this? Forming first impressions from voices |