Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning

Metadata

Presented in OSDI 2021.

Authors: Aurick Qiao, Sang Keun Choe, Suhas Jayaram Subramanya, Willie Neiswanger, Qirong Ho, Hao Zhang, Gregory R. Ganger, Eric P. Xing

Code (AdaptDL): https://github.com/petuum/adaptdl

Understanding the paper

TL;DR

This paper presents a deep learning cluster scheduler named Pollux, which co-adaptively allocates resources (the number of GPUs) and tunes the hyperparameters (the batch size and learning rate) for all DL training jobs in a shared cluster.

Background

The running time of each training iteration can be divided into two main components.
- $$\text{T}_{\text{grad}}$$: the time spent computing the gradient.
- $$\text{T}_{\text{sync}}$$: the time spent synchronizing across all GPUs.
  - Collective all-reduce => average gradient
  - Parameter servers => synchronize weight
GNS (noise-to-signal ratio of the stochastic gradient)
- A larger GNS => higher batch size or learning rate with less reduction of statistical efficiency
- Vary greatly between different DL models
- Non-constant, gradually increase during training
Existing DL schedulers
- Non-scale-adaptive: Tiresias, Gandiva => require users to specify the number of GPUs
- Scale-adaptive: Optimus, SLAQ, Gavel, Antman, Themis

Key designs

Different number of GPUs, different stage of training => different best batch size => larger batch sizes can be more useful later in training
Propose a formulation of goodput for DL jobs, which combines system throughput with model statistical efficiency.
Focus on three configuration parameters
- The number of GPUs
- Per-GPU batch size
- Number of gradient accumulation steps
Adaptively co-optimize inter-dependent factors: 1) at the per-job level; 2) at the cluster-wide level.

Evaluation

Compared to SOTA DL schedulers,

reduce average job completion times (JCT)
promote fairness among DL jobs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pollux.md

pollux.md

Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning

Metadata

Understanding the paper

TL;DR

Background

Key designs

Evaluation

Files

pollux.md

Latest commit

History

pollux.md

File metadata and controls

Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning

Metadata

Understanding the paper

TL;DR

Background

Key designs

Evaluation