Skip to content

Latest commit

 

History

History
107 lines (96 loc) · 9.62 KB

resource-scheduler.md

File metadata and controls

107 lines (96 loc) · 9.62 KB

Resource Scheduler

{% hint style="info" %} I am actively maintaining this list. {% endhint %}

Scheduling for DL Training Workloads

  • CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters (NSDI 2024) [Paper]
    • MIT & UT-Austin
    • Consider the communication pattern of different jobs while placing them on network links.
  • Blox: A Modular Toolkit for Deep Learning Schedulers (EuroSys 2024) [arXiv] [Code]
    • UW-Madison & MSR
  • Interference-aware Multiplexing for Deep Learning in GPU Clusters: A Middleware Approach (SC 2023) [Personal Notes] [Paper] [Code]
    • UMacau & SIAT, CAS
    • IADeep — a cluster scheduler to co-locate DL training tasks
    • Tune training configurations (e.g., batch size) across all co-located tasks; choose appropriate tasks to multiplex on a GPU device; consider PCIe bandwidth
  • Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling (SOSP 2023) [Paper]
    • CMU & Cornell & Petuum Inc.
  • Lyra: Elastic Scheduling for Deep Learning Clusters (EuroSys 2023) [Personal Notes] [Paper] [arXiv]
    • ByteDance & CityU & CUHK
    • Loan idle inference GPU servers for elastic training jobs.
  • Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning (NSDI 2023) [Personal Notes] [Paper] [Code]
    • UW-Madison & UT-Austin
    • Elastic resource requirements; extend market theory.
  • Lucid: A Non-intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs (ASPLOS 2023) [Personal Notes] [Paper] [Code]
    • NTU & Shanghai AI Lab & SenseTime
    • Scheduling interpretability
  • Multi-Resource Interleaving for Deep Learning Training (SIGCOMM 2022) [Personal Notes] [Paper] [Code]
    • PKU & ByteDance
    • Muri: Pack jobs along multiple resource types in the time dimension
    • Integrate with PyTorch
  • Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads (arXiv 2202.07848) [Personal Notes] [Paper]
    • Microsoft
    • Live GPU job migration
  • Looking Beyond GPUs for DNN Scheduling on Multi-Tenant Clusters (OSDI 2022) [Personal Notes] [Paper] [Code]
    • MSR & UT-Austin & VMware Research
    • Consider the allocation of CPU and memory resources.
  • Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning (OSDI 2021) [Personal Notes] [Paper] [Code]
    • Petuum & CMU
    • Best Paper Award
    • Co-adaptively allocates resources (number of GPUs) and tunes the hyperparameters (batch size and learning rate) for all DL training jobs.
  • MAPA: Multi-Accelerator Pattern Allocation Policy for Multi-Tenant GPU Servers (SC 2021) [Paper] [Code]
    • UC Riverside & Pacific Northwest National Lab & USydney
    • Consider multi-GPU accelerator topologies such as single/double NVLink.
  • Astraea: A Fair Deep Learning Scheduler for Multi-Tenant GPU Clusters (TPDS 2021) [Paper]
    • PKU & NTU & SenseTime
    • Long-term GPU-time fairness
  • AntMan: Dynamic Scaling on GPU Clusters for Deep Learning (OSDI 2020) [Paper] [Code]
    • Alibaba
    • Co-locate resource-guarantee and best-effort jobs.
  • HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees (OSDI 2020) [Personal Notes] [Paper] [Code]
    • MSRA
    • Virtual private clusters; resource isolation and management for multi-tenant clusters.
  • Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads (OSDI 2020) [Paper] [Code]
    • MSR & Stanford
    • Gavel: Consider performance heterogeneity across multiple accelerator types.
  • Themis: Fair and Efficient GPU Cluster Scheduling (EuroSys 2020) [Paper]
    • UW-Madison & MSR
    • Long-term fairness
  • AlloX: Compute Allocation in Hybrid Clusters (EuroSys 2020) [Paper] [Code]
    • Stony Brook University & SUNY Korea & UMich
    • CPU-GPU hybrid clusters; min-cost bipartite matching.
  • Balancing Efficiency and Fairness in Heterogeneous GPU Clusters for Deep Learning (EuroSys 2019) [Paper]
    • MSR India
    • $$\text{Gandiva}_\text{Fair}$$: Achieve efficiency and fairness despite cluster heterogeneity
  • Tiresias: A GPU Cluster Manager for Distributed Deep Learning (NSDI 2019) [Paper] [Code]
    • UMich SymbioticLab
    • Relax consolidated placement constraint
  • Gandiva: Introspective Cluster Scheduling for Deep Learning (OSDI 2018) [Paper]
    • MSRA
    • Hyper-parameter tuning jobs; job packing; migration; grow-shrink; time-slicing.
  • Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters (EuroSys 2018) [Paper] [Code]
    • HKU & ByteDance
    • Minimize JCT based on online resource-performance models.
  • Topology-Aware GPU Scheduling for Learning Workloads in Cloud Environments (SC 2017) [Paper] [Code]
    • Barcelona Supercomputing Center & IBM Watson Research Center
    • Consider multiple link technologies such as PCI-e and NVLink.

Scheduling for General ML Training Workloads

  • SLAQ: Quality-Driven Scheduling for Distributed Machine Learning (SoCC 2017) [Personal Notes] [Paper]
    • Princeton
    • Fine-grained job-level scheduler
    • Leverage the iterative nature of general ML training algorithms

Trace Analysis

  • MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters (NSDI 2022) [Paper] [Trace]
    • HKUSt & Alibaba
    • GPU sharing traces
  • Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters (SC 2021) [Paper] [Trace]
    • NTU & SenseTime
  • Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads (ATC 2019) [Paper] [Trace]
    • MSR
  • Characterizing Deep Learning Training Workloads on Alibaba-PAI (IISWC 2019) [Paper]
    • Alibaba PAI

Survey

  • Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision (arXiv 2205.11913) [Paper] [Paper List]
    • NTU & PKU & SenseTime

Acronyms

  • DL: Deep Learning
  • ML: Machine Learning