University of Cambridge > > Computer Laboratory Systems Research Group Seminar > Making the Most of Massive Clusters

Making the Most of Massive Clusters

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Srinivasan Keshav.

Resource management systems play an important role in today’s large clusters, allocating jobs/containers to compute resources while balancing metrics like fairness, efficiency, and fault tolerance. Existing management policies in systems such as Kubernetes, VMWare’s DRS , and Red Hat’s OpenShift rely on heuristic-based schedulers which often scale well but are typically sub-optimal. This problem is made worse by the growing trend of heterogeneous clusters—composed of a mix of several generations of CPUs, GPUs, etc. —where existing heuristics perform poorly.

This talk will emphasize the environmental footprint of large resource clusters as a key motivation. I’ll first describe our work on allocating ML training jobs in heterogeneous clusters. A key insight is that many popular scheduling objectives can be cast as mathematical optimization problems whose solutions can maximize cluster efficiency; other systems take a similar approach, for example TetriSched and Facebook’s RAS . However, optimization-based techniques are notorious for scaling poorly to massive systems. To address this issue, I will describe POP : a technique to partition the problem and quickly approximate the optimal allocation. POP reduces solve times by several orders of magnitude with minimal performance loss across a wide range of problem domains, including cluster scheduling and network traffic engineering.

Bio: Fiodar is currently a postdoc fellow at the Stanford Future Data Systems lab, working with Matei Zaharia and Peter Bailis. His research interests span ML systems, energy systems, and data science, with a focus on finding practical solutions to fundamental problems. He obtained his PhD from the University of Waterloo, where his thesis on the optimization of solar panel and battery systems was recognized through the Cheriton Distinguished Dissertation award.

This talk is part of the Computer Laboratory Systems Research Group Seminar series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.


© 2006-2024, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity