University of Cambridge > Talks.cam > Computer Laboratory Systems Research Group Seminar > Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management

Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Eiko Yoneki.

As users of “big data” applications expect fresh results, we witness a new breed of stream processing systems (SPS) that are designed to scale to large numbers of cloud-hosted machines. Such systems face new challenges: (i) to benefit from the “pay-as-you-go” model of cloud computing, they must scale out on demand, acquiring additional virtual machines (VMs) and parallelising operators when the workload increases; (ii) failures are common with deployments on hundreds of VMs—systems must be fault-tolerant with fast recovery times, yet low per-machine overheads. An open question is how to achieve these two goals when stream queries include stateful operators, which must be scaled out and recovered without affecting query results.

Our key idea is to expose internal operator state explicitly to the SPS through a set of state management primi- tives. Based on them, we describe an integrated approach for dynamic scale out and recovery of stateful operators. Externalised operator state is checkpointed periodically by the SPS and backed up to upstream VMs. The SPS identifies individual operator bottlenecks and automatically scales them out by allocating new VMs and partitioning the checkpointed state. At any point, failed operators are recovered by restoring checkpointed state on a new VM and replaying unprocessed tuples. We evaluate this approach with the Linear Road Benchmark on the Amazon EC2 cloud platform and show that it can scale automatically to a load factor of L=350 with 50 VMs, while recovering quickly from failures.

Joint work with Raul Castro Fernandez, Matteo Migliavacca and Peter Pietzuch.

Bio: Eva Kalyvianaki is a lecturer in City University London in the Department of Computer Science. She holds a PhD from Cambridge University and MSc and BSc degrees from the University of Crete, Greece. Before joining City University she was a post-doctoral researcher in Imperial College London. Her research interests span the areas of cloud computing, real-time query processing, autonomic computing and systems and performance in general.

This talk is part of the Computer Laboratory Systems Research Group Seminar series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

© 2006-2025 Talks.cam, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity