University of Cambridge > Talks.cam > Computer Laboratory Systems Research Group Seminar > Examining Raft's behaviour during partial network failures

Examining Raft's behaviour during partial network failures

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Srinivasan Keshav.

State machine replication protocols such as Raft are widely used to build highly-available strongly-consistent services, maintaining liveness even if a minority of servers crash. As these systems are implemented and optimised for production, they accumulate many divergences from the original specification. These divergences are poorly documented, resulting in operators having an incomplete model of the system’s characteristics, especially during failures. In this paper, we look at one such Raft model used to explain the November Cloudflare outage and show that etcd’s behaviour during the same failure differs. We continue to show the specific optimisations in etcd causing this difference and present a more complete model of the outage based on etcd’s behaviour in an emulated deployment using reckon. Finally, we highlight the upcoming PreVote optimisation in etcd, which might have prevented the outage from happening in the first place.

Bio:

Chris Jensen is a first year PhD student in the SRG , focusing on benchmarking and improving the availability of strongly consistent distributed databases. He previously completed his BSc in Computer Science at the University of Cambridge.

This talk is part of the Computer Laboratory Systems Research Group Seminar series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

© 2006-2024 Talks.cam, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity