{% hint style="info" %} I am actively maintaining this list. {% endhint %}
- CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters (NSDI 2024) [Paper]
- MIT & UT-Austin
- Consider the communication pattern of different jobs while placing them on network links.
- Blox: A Modular Toolkit for Deep Learning Schedulers (EuroSys 2024) [arXiv] [Code]
- UW-Madison & MSR
- Interference-aware Multiplexing for Deep Learning in GPU Clusters: A Middleware Approach (SC 2023) [Personal Notes] [Paper] [Code]
- UMacau & SIAT, CAS
- IADeep — a cluster scheduler to co-locate DL training tasks
- Tune training configurations (e.g., batch size) across all co-located tasks; choose appropriate tasks to multiplex on a GPU device; consider PCIe bandwidth
- Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling (SOSP 2023) [Paper]
- CMU & Cornell & Petuum Inc.
- Lyra: Elastic Scheduling for Deep Learning Clusters (EuroSys 2023) [Personal Notes] [Paper] [arXiv]
- ByteDance & CityU & CUHK
- Loan idle inference GPU servers for elastic training jobs.
- Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning (NSDI 2023) [Personal Notes] [Paper] [Code]
- UW-Madison & UT-Austin
- Elastic resource requirements; extend market theory.
- Lucid: A Non-intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs (ASPLOS 2023) [Personal Notes] [Paper] [Code]
- NTU & Shanghai AI Lab & SenseTime
- Scheduling interpretability
- Multi-Resource Interleaving for Deep Learning Training (SIGCOMM 2022) [Personal Notes] [Paper] [Code]
- PKU & ByteDance
- Muri: Pack jobs along multiple resource types in the time dimension
- Integrate with PyTorch
- Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads (arXiv 2202.07848) [Personal Notes] [Paper]
- Microsoft
- Live GPU job migration
- Looking Beyond GPUs for DNN Scheduling on Multi-Tenant Clusters (OSDI 2022) [Personal Notes] [Paper] [Code]
- MSR & UT-Austin & VMware Research
- Consider the allocation of CPU and memory resources.
- Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning (OSDI 2021) [Personal Notes] [Paper] [Code]
- Petuum & CMU
- Best Paper Award
- Co-adaptively allocates resources (number of GPUs) and tunes the hyperparameters (batch size and learning rate) for all DL training jobs.
- MAPA: Multi-Accelerator Pattern Allocation Policy for Multi-Tenant GPU Servers (SC 2021) [Paper] [Code]
- UC Riverside & Pacific Northwest National Lab & USydney
- Consider multi-GPU accelerator topologies such as single/double NVLink.
- Astraea: A Fair Deep Learning Scheduler for Multi-Tenant GPU Clusters (TPDS 2021) [Paper]
- PKU & NTU & SenseTime
- Long-term GPU-time fairness
- AntMan: Dynamic Scaling on GPU Clusters for Deep Learning (OSDI 2020) [Paper] [Code]
- Alibaba
- Co-locate resource-guarantee and best-effort jobs.
- HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees (OSDI 2020) [Personal Notes] [Paper] [Code]
- MSRA
- Virtual private clusters; resource isolation and management for multi-tenant clusters.
- Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads (OSDI 2020) [Paper] [Code]
- MSR & Stanford
- Gavel: Consider performance heterogeneity across multiple accelerator types.
- Themis: Fair and Efficient GPU Cluster Scheduling (EuroSys 2020) [Paper]
- UW-Madison & MSR
- Long-term fairness
- AlloX: Compute Allocation in Hybrid Clusters (EuroSys 2020) [Paper] [Code]
- Stony Brook University & SUNY Korea & UMich
- CPU-GPU hybrid clusters; min-cost bipartite matching.
- Balancing Efficiency and Fairness in Heterogeneous GPU Clusters for Deep Learning (EuroSys 2019) [Paper]
- MSR India
-
$$\text{Gandiva}_\text{Fair}$$ : Achieve efficiency and fairness despite cluster heterogeneity
- Tiresias: A GPU Cluster Manager for Distributed Deep Learning (NSDI 2019) [Paper] [Code]
- UMich SymbioticLab
- Relax consolidated placement constraint
- Gandiva: Introspective Cluster Scheduling for Deep Learning (OSDI 2018) [Paper]
- MSRA
- Hyper-parameter tuning jobs; job packing; migration; grow-shrink; time-slicing.
- Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters (EuroSys 2018) [Paper] [Code]
- HKU & ByteDance
- Minimize JCT based on online resource-performance models.
- Topology-Aware GPU Scheduling for Learning Workloads in Cloud Environments (SC 2017) [Paper] [Code]
- Barcelona Supercomputing Center & IBM Watson Research Center
- Consider multiple link technologies such as PCI-e and NVLink.
- SLAQ: Quality-Driven Scheduling for Distributed Machine Learning (SoCC 2017) [Personal Notes] [Paper]
- Princeton
- Fine-grained job-level scheduler
- Leverage the iterative nature of general ML training algorithms
- MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters (NSDI 2022) [Paper] [Trace]
- HKUSt & Alibaba
- GPU sharing traces
- Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters (SC 2021) [Paper] [Trace]
- NTU & SenseTime
- Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads (ATC 2019) [Paper] [Trace]
- MSR
- Characterizing Deep Learning Training Workloads on Alibaba-PAI (IISWC 2019) [Paper]
- Alibaba PAI
- Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision (arXiv 2205.11913) [Paper] [Paper List]
- NTU & PKU & SenseTime
- DL: Deep Learning
- ML: Machine Learning