BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Talks.cam//talks.cam.ac.uk//
X-WR-CALNAME:Talks.cam
BEGIN:VEVENT
SUMMARY:Mitigating the Risks of Metastable Failures in Distributed Systems
  - Aleksey Charapko\, University of New Hampshire
DTSTART:20250328T153000Z
DTEND:20250328T163000Z
UID:TALK229042@talks.cam.ac.uk
CONTACT:Richard Mortier
DESCRIPTION:"Teams link -- click here":https://teams.microsoft.com/l/meetu
 p-join/19%3ameeting_NmQ5YzhmYjUtZGZjYy00NGIzLWEzY2QtZWM2NWNmMzg4NTZh%40thr
 ead.v2/0?context=%7b%22Tid%22%3a%2249a50445-bdfa-4b79-ade3-547b4f3986e9%22
 %2c%22Oid%22%3a%22c74ff4ca-98fe-4b28-9889-e119acc12f30%22%7d\n\nMetastable
  failures refer to a class of catastrophic system failures that cause a pe
 rmanent\, self-sustaining overload of the impacted system. Distinguishing 
 characteristics of metastable failures are the initial trigger that tempor
 arily overloads the system and the sustaining effect that kicks in due to 
 such overload and keeps the systems in the overloaded state\, even after t
 he initial trigger is fixed. Once in this permanently overloaded state\, c
 alled the metastable failure state\, the system is perpetually busy but un
 able to complete any useful work until drastic manual measures\, such as r
 estarting the system\, are taken. Metastable failures have led to several 
 prominent cloud outages in recent years.\n\nThis seminar explores strategi
 es for mitigating the risks of metastable failures in distributed systems.
  First\, we focus on the practical robustness of algorithms and systems\, 
 accounting for the performance cost of fault tolerance and error handling.
  Then\, we look at the importance of identifying and protecting vulnerable
  components in large distributed systems to tame the sustaining effects an
 d prevent the sustaining mechanisms from developing into a positive feedba
 ck loop. Finally\, we discuss "metastable failure poisoning" -- a feedback
  mechanism that spreads the failure across seemingly isolated systems or c
 omponents.\n\nBio: Aleksey Charapko is an assistant professor at the Unive
 rsity of New Hampshire. He received his Ph.D. from the University at Buffa
 lo\, working on consensus algorithms and state machine replication. Now\, 
 Aleksey is broadly interested in distributed systems' performance\, reliab
 ility\, and efficiency. Aleksey has received several awards and research g
 rants\, most recently an NSF CAREER award for the "metastable failures" re
 search. In addition to his academic endeavors\, Aleksey has over a decade 
 of engineering experience ranging from freelance to big tech to consulting
 . 
LOCATION:Computer Lab\, FW11 and Online (Teams link will appear before the
  talk)
END:VEVENT
END:VCALENDAR
