Presented in OSDI 2021.
Authors: Aurick Qiao, Sang Keun Choe, Suhas Jayaram Subramanya, Willie Neiswanger, Qirong Ho, Hao Zhang, Gregory R. Ganger, Eric P. Xing
Code (AdaptDL): https://github.com/petuum/adaptdl
This paper presents a deep learning cluster scheduler named Pollux, which co-adaptively allocates resources (the number of GPUs) and tunes the hyperparameters (the batch size and learning rate) for all DL training jobs in a shared cluster.
- The running time of each training iteration can be divided into two main components.
-
$$\text{T}_{\text{grad}}$$ : the time spent computing the gradient. -
$$\text{T}_{\text{sync}}$$ : the time spent synchronizing across all GPUs.- Collective all-reduce => average gradient
- Parameter servers => synchronize weight
-
- GNS (noise-to-signal ratio of the stochastic gradient)
- A larger GNS => higher batch size or learning rate with less reduction of statistical efficiency
- Vary greatly between different DL models
- Non-constant, gradually increase during training
- Existing DL schedulers
- Non-scale-adaptive: Tiresias, Gandiva => require users to specify the number of GPUs
- Scale-adaptive: Optimus, SLAQ, Gavel, Antman, Themis
- Different number of GPUs, different stage of training => different best batch size => larger batch sizes can be more useful later in training
- Propose a formulation of goodput for DL jobs, which combines system throughput with model statistical efficiency.
- Focus on three configuration parameters
- The number of GPUs
- Per-GPU batch size
- Number of gradient accumulation steps
- Adaptively co-optimize inter-dependent factors: 1) at the per-job level; 2) at the cluster-wide level.
Compared to SOTA DL schedulers,
- reduce average job completion times (JCT)
- promote fairness among DL jobs