Peloton: Resource Scheduling at Uber

This post is a summary of the paper:
Peloton: Uber’s Unified Resource Scheduler for Diverse Cluster Workloads, Cai et al., 2018

Emergence of Peloton

Cluster Management deals with aggregating physical hosts into a shared resource pool on which jobs or services can be scheduled. The simplest setup would have a dedicated pool of hosts for each type of workload. For instance, if there are two types of workloads - batch jobs and long running services, there can be two resource pools; one that runs batch jobs exclusively and the other that hosts services.

Such a setup, although simple, does not utilize the clusters to their full extent. At Uber, vast fluctuations in rideshare demands due to events like holidays require over-provisioning hardware for each cluster because each cluster is scaled for peak workload.

To better utilize resources, diverse workloads need to be co-located on the same cluster controlled by a unified compute platform. Enter Peloton, a unified scheduler designed to manage resources across diverse workloads effectively combining many clusters into one.

How does co-locating diverse workloads on shared clusters help?

There are four types of compute cluster workloads at Uber:

Stateless jobs
Stateful jobs
Batch jobs
Daemon jobs

Both stateless and stateful jobs are long-running services that are usually non-preemptible and latency sensitive. Batch jobs, on the other hand, can be preempted and are less sensitive to short-term performance fluctuations. Daemon jobs are infrastructure agents running on each host.

Uber achieves high cluster utilization by resource over-commitment and job preemption. Preempting online jobs that are latency sensitive (like a rideshare request) is expensive and degrades rider experience. The key is to co-locate preemptible batch jobs that are low-priority on the same cluster so that over-committed resources can be better utilized.

In an active-active architecture, excess capacity is reserved for disaster recovery. This capacity can be used for batch jobs until an outage occurs.

Similarly, since Uber buys hardware considering peak workloads, most of the cluster is under-utilized throughout the non-peak time. By lending capacity from batch jobs to online workloads, the high-traffic peak events can be handled by lending capacity from batch jobs to online jobs.

Peloton Architecture Overview

Peloton uses Apache Mesos for resource aggregation and task execution. Tasks are submitted to Mesos as Docker containers. Four daemon types form the core components of Peloton:

Job Manager
Resource Manager
Placement Engine
Host Manager

Each of these components uses Apache Zookeeper for service discovery and leader election.

Host Manager is responsible to abstract away the Mesos specific details from other Peloton components.

Resource Manager and Placement Engine sit on top of Host Manager. Resource Manager maintains resource pool hierarchy and calculates resource entitlement for each pool. Placement Engine maps tasks to hosts taking into account job and task constraints as well as host attributes.

Job Manager is responsible for lifecycles of jobs, tasks and volumes.

Apart from these core components, the Peloton system consists of:

Peloton API that uses Protocol Buffers as the interface definition language and YARPC as its RPC runtime
Peloton UI and Peloton CLI that are built on top of Peloton API
Storage Gateway that provides an abstraction layer for Peloton to operate on different storage backends
Group Membership that manages Peloton master instances

Scalability

The design goal of Peloton is to support millions of active tasks, 50,000 hosts and a maximum throughput of 1000 tasks per second for scheduling decisions and launching tasks.

Job Manager and Resource Manager are scalable horizontally. The Resource Manager keeps tasks in memory making it possible to handle high throughput of scheduling and launching tasks.

In order to support a cluster size of 50,000 machines, Peloton manages multiple Mesos clusters by instantiating a set of Host Managers for each Mesos cluster. This is because Mesos is known to scale well to about 45,000 hosts when running about 200,000 long running low-churn rate tasks. However, the workload at Uber consists of highly dynamic batch jobs. By managing multiple Mesos clusters, Peloton can ensure that each cluster is well-sized for this kind of workload at Uber. The clusters collectively may contain about 50,000 machines.

Elastic Resource Management

Peloton models a subset of resources in a cluster as resource pools. Resource pools are hierarchical in nature which means that a resource pool can contain several child resource pools. This allows Peloton to divide resources among organizations and teams within an organization.

Resource sharing among pools is elastic in nature—resource pools with high demand can borrow resources from other pools if they are not using those resources.

Peloton uses hierarchical max-min fairness for resource management.

Each organization within Uber gets a fixed guarantee of resources. Within an organizational boundary, job priority is enforced.

There are different resource dimensions like CPUs, disk, GPUs and memory size on which resource controls like reservation, limit, share and entitlement can be applied. While sharing resources among resource pools enables efficient usage of the clusters, it comes with the cost of not being able to satisfy the compute need of lenders when they require it.

Peloton uses preemption to ensure SLA guarantees. Inter-resource pool preemption enforces max-min fairness across all resource pools. Intra-resource pool preemption frees up resources within a pool by preempting low-priority jobs.