Ray: A Distributed Framework for Emerging AI Applications Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica, UC Berkeley
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen and Thierry Moreau, University of Washington; Ziheng Jiang, University of Washington, AWS; Lianmin Zheng, Shanghai Jiao Tong University; Eddie Yan, Haichen Shen, and Meghan Cowan, University of Washington; Leyuan Wang, UC Davis, AWS; Yuwei Hu, Cornell; Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy, University of Washington
Gandiva: Introspective Cluster Scheduling for Deep Learning Wencong Xiao, Beihang University & Microsoft Research; Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, and Nipun Kwatra, Microsoft Research; Zhenhua Han, The University of Hong Kong & Microsoft Research; Pratyush Patel, Microsoft Research; Xuan Peng, Huazhong University of Science and Technology & Microsoft Research; Hanyu Zhao, Peking University & Microsoft Research; Quanlu Zhang, Fan Yang, and Lidong Zhou, Microsoft Research
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems Yunseong Lee, Seoul National University; Alberto Scolari, Politecnico di Milano; Byung-Gon Chun, Seoul National University; Marco Domenico Santambrogio, Politecnico di Milano; Markus Weimer and Matteo Interlandi, Microsoft
Serving DNNs like Clockwork: Performance Predictability from the Bottom Up Arpan Gujarati, Max Planck Institute for Software Systems; Reza Karimi, Emory University; Safya Alzayat, Wei Hao, and Antoine Kaufmann, Max Planck Institute for Software Systems; Ymir Vigfusson, Emory University; Jonathan Mace, Max Planck Institute for Software Systems
A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters Yimin Jiang, Tsinghua University and ByteDance; Yibo Zhu, ByteDance; Chang Lan, Google; Bairen Yi, ByteDance; Yong Cui, Tsinghua University; Chuanxiong Guo, ByteDance
Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads Deepak Narayanan and Keshav Santhanam, Stanford University and Microsoft Research; Fiodar Kazhamiaka, Stanford University; Amar Phanishayee, Microsoft Research; Matei Zaharia, Stanford University
PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications Zhihao Bai and Zhen Zhang, Johns Hopkins University; Yibo Zhu, ByteDance Inc.; Xin Jin, Johns Hopkins University
HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees Hanyu Zhao, Peking University and Microsoft; Zhenhua Han, The University of Hong Kong and Microsoft; Zhi Yang, Peking University; Quanlu Zhang, Fan Yang, Lidong Zhou, and Mao Yang, Microsoft; Francis C.M. Lau, The University of Hong Kong; Yuqi Wang, Yifan Xiong, and Bin Wang, Microsoft
AntMan: Dynamic Scaling on GPU Clusters for Deep Learning Wencong Xiao, Shiru Ren, Yong Li, Yang Zhang, Pengyang Hou, Zhi Li, Yihui Feng, Wei Lin, and Yangqing Jia, Alibaba Group
Twine: A Unified Cluster Management System for Shared Infrastructure Chunqiang Tang, Kenny Yu, Kaushik Veeraraghavan, Jonathan Kaldor, Scott Michelson, Thawan Kooburat, Aravind Anbudurai, Matthew Clark, Kabir Gogia, Long Cheng, Ben Christensen, Alex Gartrell, Maxim Khutornenko, Sachin Kulkarni, Marcin Pawlowski, Tuomas Pelkonen, Andre Rodrigues, Rounak Tibrewal, Vaishnavi Venkatesan, and Peter Zhang, Facebook Inc.
Building Scalable and Flexible Cluster Managers Using Declarative Programming Lalith Suresh, VMware; João Loff, IST (ULisboa) / INESC-ID; Faria Kalim, UIUC; Sangeetha Abdu Jyothi, UC Irvine and VMware; Nina Narodytska, Leonid Ryzhyk, Sahan Gamage, Brian Oki, Pranshu Jain, and Michael Gasch, VMware
Ansor: Generating High-Performance Tensor Programs for Deep Learning Lianmin Zheng, UC Berkeley; Chengfan Jia, Minmin Sun, and Zhao Wu, Alibaba Group; Cody Hao Yu, Amazon Web Services; Ameer Haj-Ali, UC Berkeley; Yida Wang, Amazon Web Services; Jun Yang, Alibaba Group; Danyang Zhuo, UC Berkeley and Duke University; Koushik Sen, Joseph E. Gonzalez, and Ion Stoica, UC Berkeley
Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks Lingxiao Ma, Peking University and Microsoft Research; Zhiqiang Xie, ShanghaiTech University and Microsoft Research; Zhi Yang, Peking University; Jilong Xue, Youshan Miao, Wei Cui, Wenxiang Hu, Fan Yang, Lintao Zhang, and Lidong Zhou, Microsoft Research
A Tensor Compiler for Unified Machine Learning Prediction Serving Supun Nakandala, UC San Diego; Karla Saur, Microsoft; Gyeong-In Yu, Seoul National University; Konstantinos Karanasos, Carlo Curino, Markus Weimer, and Matteo Interlandi, Microsoft
Retiarii: A Deep Learning Exploratory-Training Framework Quanlu Zhang, Zhenhua Han, Fan Yang, Yuge Zhang, Zhe Liu, Mao Yang, and Lidong Zhou, Microsoft Research
KungFu: Making Training in Distributed Machine Learning Adaptive Luo Mai, Guo Li, Marcel Wagenländer, Konstantinos Fertakis, Andrei-Octavian Brabete, and Peter Pietzuch, Imperial College London
Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning Aurick Qiao, Petuum, Inc. and Carnegie Mellon University; Sang Keun Choe and Suhas Jayaram Subramanya, Carnegie Mellon University; Willie Neiswanger, Petuum, Inc. and Carnegie Mellon University; Qirong Ho, Petuum, Inc.; Hao Zhang, Petuum, Inc. and UC Berkeley; Gregory R. Ganger, Carnegie Mellon University; Eric P. Xing, MBZUAI, Petuum, Inc., and Carnegie Mellon University
Oort: Efficient Federated Learning via Guided Participant Selection Fan Lai, Xiangfeng Zhu, Harsha V. Madhyastha, and Mosharaf Chowdhury, University of Michigan
PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections Haojie Wang, Jidong Zhai, Mingyu Gao, Zixuan Ma, Shizhi Tang, and Liyan Zheng, Tsinghua University; Yuanzhi Li, Carnegie Mellon University; Kaiyuan Rong and Yuanyong Chen, Tsinghua University; Zhihao Jia, Carnegie Mellon University and Facebook
Dorylus: Affordable, Scalable, and Accurate GNN Training with Distributed CPU Servers and Serverless Threads John Thorpe, Yifan Qiao, Jonathan Eyolfson, and Shen Teng, UCLA; Guanzhou Hu, UCLA and University of Wisconsin, Madison; Zhihao Jia, CMU; Jinliang Wei, Google Brain; Keval Vora, Simon Fraser; Ravi Netravali, Princeton University; Miryung Kim and Guoqing Harry Xu, UCLA
GNNAdvisor: An Adaptive and Efficient Runtime System for GNN Acceleration on GPUs Yuke Wang, Boyuan Feng, Gushu Li, Shuangchen Li, Lei Deng, Yuan Xie, and Yufei Ding, University of California, Santa Barbara
Marius: Learning Massive Graph Embeddings on a Single Machine Jason Mohoney and Roger Waleffe, University of Wisconsin–Madison; Henry Xu, University of Maryland, College Park; Theodoros Rekatsinas and Shivaram Venkataraman, University of Wisconsin–Madison
P3: Distributed Deep Graph Learning at Scale Swapnil Gandhi and Anand Padmanabha Iyer, Microsoft Research
SparTA: Deep-Learning Model Sparsity via Tensor-with-Sparsity-Attribute Ningxin Zheng, Microsoft Research; Bin Lin, Microsoft Research and Tsinghua University; Quanlu Zhang, Lingxiao Ma, Yuqing Yang, Fan Yang, Yang Wang, Mao Yang, and Lidong Zhou, Microsoft Research
ROLLER: Fast and Efficient Tensor Compilation for Deep Learning Hongyu Zhu, University of Toronto and Microsoft Research; Ruofan Wu, Renmin University of China and Microsoft Research; Yijia Diao, Shanghai Jiao Tong University and Microsoft Research; Shanbin Ke, UCSD and Microsoft Research; Haoyu Li, Columbia University and Microsoft Research; Chen Zhang, Tsinghua University and Microsoft Research; Jilong Xue, Lingxiao Ma, Yuqing Xia, Wei Cui, Fan Yang, Mao Yang, and Lidong Zhou, Microsoft Research; Asaf Cidon, Columbia University; Gennady Pekhimenko, University of Toronto
Walle: An End-to-End, General-Purpose, and Large-Scale Production System for Device-Cloud Collaborative Machine Learning Chengfei Lv, Zhejiang University and Alibaba Group; Chaoyue Niu, Shanghai Jiao Tong University and Alibaba Group; Renjie Gu, Xiaotang Jiang, Zhaode Wang, Bin Liu, Ziqi Wu, Qiulin Yao, Congyu Huang, Panos Huang, Tao Huang, Hui Shu, Jinde Song, Bin Zou, Peng Lan, and Guohuan Xu, Alibaba Group; Fei Wu, Zhejiang University; Shaojie Tang, University of Texas at Dallas; Fan Wu and Guihai Chen, Shanghai Jiao Tong University
Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization Colin Unger, Stanford University; Zhihao Jia, Carnegie Mellon University and Meta; Wei Wu, Los Alamos National Laboratory and NVIDIA; Sina Lin, Microsoft; Mandeep Baines and Carlos Efrain Quintero Narvaez, Meta; Vinay Ramakrishnaiah, Nirmal Prajapati, Pat McCormick, and Jamaludin Mohd-Yusof, Los Alamos National Laboratory; Xi Luo, SLAC National Accelerator Laboratory; Dheevatsa Mudigere, Jongsoo Park, and Misha Smelyanskiy, Meta; Alex Aiken, Stanford University
Orca: A Distributed Serving System for Transformer-Based Generative Models Gyeong-In Yu and Joo Seong Jeong, Seoul National University; Geon-Woo Kim, FriendliAI and Seoul National University; Soojeong Kim, FriendliAI; Byung-Gon Chun, FriendliAI and Seoul National University
Microsecond-scale Preemption for Concurrent GPU-accelerated DNN Inferences Mingcong Han, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University; Shanghai AI Laboratory; Hanze Zhang, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University; MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, China; Rong Chen, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University; Shanghai AI Laboratory; Haibo Chen, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University; Engineering Research Center for Domain-specific Operating Systems, Ministry of Education, China
Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning Lianmin Zheng, Zhuohan Li, and Hao Zhang, UC Berkeley; Yonghao Zhuang, Shanghai Jiao Tong University; Zhifeng Chen and Yanping Huang, Google; Yida Wang, Amazon Web Services; Yuanzhong Xu, Google; Danyang Zhuo, Duke University; Eric P. Xing, MBZUAI and Carnegie Mellon University; Joseph E. Gonzalez and Ion Stoica, UC Berkeley
Looking Beyond GPUs for DNN Scheduling on Multi-Tenant Clusters Jayashree Mohan, Amar Phanishayee, and Janardhan Kulkarni, Microsoft Research; Vijay Chidambaram, University of Texas at Austin and VMware Research
Ekko: A Large-Scale Deep Learning Recommender System with Low-Latency Model Update Chijun Sima, Tencent; Yao Fu and Man-Kit Sit, The University of Edinburgh; Liyi Guo, Xuri Gong, Feng Lin, Junyu Wu, Yongsheng Li, and Haidong Rong, Tencent; Pierre-Louis Aublin, IIJ research laboratory; Luo Mai, The University of Edinburgh
Efficient and Scalable Graph Pattern Mining on GPUs Xuhao Chen and Arvind, MIT CSAIL
Beta: Statistical Multiplexing with Model Parallelism for Deep Learning Serving Zhuohan Li and Lianmin Zheng, UC Berkeley; Yinmin Zhong, Peking University; Vincent Liu, University of Pennsylvania; Ying Sheng, Stanford University; Xin Jin, Peking University; Yanping Huang and Zhifeng Chen, Google; Hao Zhang, Joseph E. Gonzalez, and Ion Stoica, UC Berkeley
Grinder: Analysis and Optimization for Dynamic Control Flow in Deep Learning Chen Zhang, Tsinghua University; Lingxiao Ma and Jilong Xue, Microsoft Research; Yining Shi, Peking University & Microsoft Research; Ziming Miao, Microsoft; Fan Yang, Microsoft Research Asia; Jidong Zhai, Tsinghua University; Zhi Yang, Peking University; Mao Yang, Microsoft Research
Welder: Scheduling Deep Learning Memory Access via Tile-graph Yining Shi, Peking University & Microsoft Research; Zhi Yang, Peking University; Jilong Xue, Lingxiao Ma, Yuqing Xia, Ziming Miao, Yuxiao Guo, Fan Yang, and Lidong Zhou, Microsoft Research
Effectively Scheduling Computational Graphs of Deep Neural Networks toward Their Domain-Specific Accelerators Jie Zhao, State Key Laboratory of Mathematical Engineering and Advanced Computing; Lianmin Zheng, UC Berkeley; Siyuan Feng, Shanghai Jiao Tong University; Chen Tian, ACM; Xiaoqiang Dan, Fei Liu, Chengke Wang, Sheng Yuan, Wenyuan Lv, and Qikai Xie, Stream Computing Inc.
EinNet: Optimizing Tensor Programs with Derivation-Based Transformations Liyan Zheng, Haojie Wang, Jidong Zhai, Muyan Hu, Zixuan Ma, Tuowei Wang, and Shuhong Huang, Tsinghua University; Xupeng Miao, Carnegie Mellon University; Shizhi Tang and Kezhao Huang, Tsinghua University; Zhihao Jia, Carnegie Mellon University
Hydro: Surrogate-Based Hyperparameter Tuning Service in the Datacenter Qinghao Hu, Tianwei Zhang, Yonggang Wen, and Meng Zhang, Nanyang Technological University; Peng Sun, SenseTime; Zhisheng Ye, Peking University; Qiaoling Chen, National University of Singapore
Accelerating Graph Neural Networks with Fine-grained intra-kernel Communication-Computation Pipelining on Multi-GPU Platforms Yuke Wang, Boyuan Feng, and Zheng Wang, University of California, Santa Barbara; Tong Geng, Kevin Barker, and Ang Li, Pacific Northwest National Laboratory; Yufei Ding, University of California, Santa Barbara
Optimizing Dynamic Neural Networks with Brainstorm Weihao Cui, Shanghai Jiao Tong University; Zhenhua Han, Microsoft Research; Lingji Ouyang, USTC; Yichuan Wang, Shanghai Jiao Tong University; Ningxin Zheng, Lingxiao Ma, Yuqing Yang, Fan Yang, and Jilong Xue, Microsoft Research; Lili Qiu, UT Austin, MSR Asia Shanghai; Lidong Zhou, Microsoft Research; Quan Chen, Shanghai Jiao Tong University; Haisheng Tan, University of Science and Technology of China; Minyi Guo, Shanghai Jiao Tong University
AdaEmbed: Adaptive Embedding for Large-Scale Recommendation Models Fan Lai, University of Michigan; Wei Zhang, Rui Liu, William Tsai, Xiaohan Wei, Yuxi Hu, Sabin Devkota, Jianyu Huang, Jongsoo Park, Xing Liu, Zeliang Chen, Ellie Wen, Paul Rivera, Jie You, and Jason Chen, Meta Inc.; Mosharaf Chowdhury, University of Michigan
PipeDream: Generalized Pipeline Parallelism for DNN Training Deepak Narayanan (Stanford University), Aaron Harlap (Carnegie Mellon University), Amar Phanishayee (Microsoft Research), Vivek Seshadri (Microsoft Research), Nikhil R. Devanur (Microsoft Research), Gregory R. Ganger (CMU), Phillip B. Gibbons (Carnegie Mellon University), Matei Zaharia (Stanford University)
A Generic Communication Scheduler for Distributed DNN Training Acceleration Yanghua Peng (The University of Hong Kong), Yibo Zhu (ByteDance Inc.), Yangrui Chen (The University of Hong Kong), Yixin Bao (The University of Hong Kong), Bairen Yi (ByteDance Inc.), Chang Lan (ByteDance Inc.), Chuan Wu (The University of Hong Kong), Chuanxiong Guo (ByteDance Inc.)
Parity Models: Erasure-Coded Resilience for Prediction Serving Systems Jack Kosaian (Carnegie Mellon University), K. V. Rashmi (Carnegie Mellon University), Shivaram Venkataraman (University of Wisconsin-Madison)
TASO: Optimizing Deep Learning Computation with Automated Generation of Graph Substitutions Zhihao Jia (Stanford University), Oded Padon (Stanford University), James Thomas (Stanford University), Todd Warszawski (Stanford University), Matei Zaharia (Stanford University), Alex Aiken (Stanford Univeristy)
Nexus: A GPU Cluster Engine for Accelerating DNN-Based Video Analysis Haichen Shen (Amazon Web Services), Lequn Chen (University of Washington), Yuchen Jin (University of Washington), Liangyu Zhao (University of Washington), Bingyu Kong (Shanghai Jiao Tong University), Matthai Philipose (Microsoft Research), Arvind Krishnamurthy (University of Washington), Ravi Sundaram (Northeastern University)
Gradient Compression Supercharged High-Performance Data Parallel DNN Training Youhui Bai (University of Science and Technology of China), Cheng Li (University of Science and Technology of China), Quan Zhou (University of Science and Technology of China), Jun Yi (University of Nevada at Reno), Ping Gong (University of Science and Technology of China), Feng Yan (University of Nevada at Reno), Ruichuan Chen (Nokia Bell Labs), Yinlong Xu (University of Science and Technology of China)
JANUS. JANUS: Fast and Flexible Deep Learning via Symbolic Graph Execution of Imperative Programs. Eunji Jeong, Sungwoo Cho, Gyeong-In Yu, Joo Seong Jeong, Dong-Jin Shin, Byung-Gon Chun. NSDI 2019.
BLAS-on-flash: An Efficient Alternative for Large Scale ML Training and Inference? Suhas Jayaram Subramanya and Harsha Vardhan Simhadri, Microsoft Research India; Srajan Garg, IIT Bombay; Anil Kag and Venkatesh Balasubramanian, Microsoft Research India
Tiresias: A GPU Cluster Manager for Distributed Deep Learning Juncheng Gu, Mosharaf Chowdhury, and Kang G. Shin, University of Michigan, Ann Arbor; Yibo Zhu, Microsoft and Bytedance; Myeongjae Jeon, Microsoft and UNIST; Junjie Qian, Microsoft; Hongqiang Liu, Alibaba; Chuanxiong Guo, Bytedance
Themis: Fair and Efficient GPU Cluster Scheduling Kshiteej Mahajan, Arjun Balasubramanian, Arjun Singhvi, Shivaram Venkataraman, and Aditya Akella, University of Wisconsin-Madison; Amar Phanishayee, Microsoft Research; Shuchi Chawla, University of Wisconsin-Madison
Mistify: Automating DNN Model Porting for On-Device Inference at the Edge Peizhen Guo, Bo Hu, and Wenjun Hu, Yale University
Elastic Resource Sharing for Distributed Deep Learning Changho Hwang and Taehyun Kim, KAIST; Sunghyun Kim, MIT; Jinwoo Shin and KyoungSoo Park, KAIST
ATP: In-network Aggregation for Multi-tenant Learning ChonLam Lao, Tsinghua University; Yanfang Le and Kshiteej Mahajan, University of Wisconsin-Madison; Yixi Chen and Wenfei Wu, Tsinghua University; Aditya Akella and Michael Swift, University of Wisconsin-Madison
Scaling Distributed Machine Learning with In-Network Aggregation Amedeo Sapio, Marco Canini, and Chen-Yu Ho, KAUST; Jacob Nelson, Microsoft; Panos Kalnis, KAUST; Changhoon Kim, Barefoot Networks; Arvind Krishnamurthy, University of Washington; Masoud Moshref, Barefoot Networks; Dan Ports, Microsoft; Peter Richtarik, KAUST
Check-N-Run: a Checkpointing System for Training Deep Learning Recommendation Models Assaf Eisenman, Kiran Kumar Matam, Steven Ingram, Dheevatsa Mudigere, Raghuraman Krishnamoorthi, Krishnakumar Nair, and Misha Smelyanskiy, Facebook; Murali Annavaram, Facebook and USCa
MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters Qizhen Weng, Hong Kong University of Science and Technology and Alibaba Group; Wencong Xiao, Alibaba Group; Yinghao Yu, Alibaba Group and Hong Kong University of Science and Technology; Wei Wang, Hong Kong University of Science and Technology; Cheng Wang, Jian He, Yong Li, Liping Zhang, Wei Lin, and Yu Ding, Alibaba Group
Accelerating Collective Communication in Data Parallel Training across Deep Learning Frameworks Joshua Romero, NVIDIA, Inc.; Junqi Yin, Nouamane Laanait, Bing Xie, and M. Todd Young, Oak Ridge National Laboratory; Sean Treichler, NVIDIA, Inc.; Vitalii Starchenko and Albina Borisevich, Oak Ridge National Laboratory; Alex Sergeev, Carbon Robotics; Michael Matheson, Oak Ridge National Laboratory
Cocktail: A Multidimensional Optimization for Model Serving in Cloud Jashwant Raj Gunasekaran, Cyan Subhra Mishra, Prashanth Thinakaran, Bikash Sharma, Mahmut Taylan Kandemir, and Chita R. Das, The Pennsylvania State University
Transparent GPU Sharing in Container Clouds for Deep Learning Workloads Bingyang Wu and Zili Zhang, Peking University; Zhihao Bai, Johns Hopkins University; Xuanzhe Liu and Xin Jin, Peking University
ARK: GPU-driven Code Execution for Distributed Deep Learning Changho Hwang, Microsoft Research; KyoungSoo Park, KAIST; Ran Shu, Xinyuan Qu, Peng Cheng, and Yongqiang Xiong, Microsoft Research
BGL: GPU-Efficient GNN Training by Optimizing Graph Data I/O and Preprocessing Tianfeng Liu, Tsinghua University, Zhongguancun Laboratory, ByteDance; Yangrui Chen, The University of Hong Kong, ByteDance; Dan Li, Tsinghua University, Zhongguancun Laboratory; Chuan Wu, The University of Hong Kong; Yibo Zhu, Jun He, and Yanghua Peng, ByteDance; Hongzheng Chen, ByteDance, Cornell University; Hongzhi Chen and Chuanxiong Guo, ByteDance
Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training Jie You, Jae-Won Chung, and Mosharaf Chowdhury, University of Michigan
Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs John Thorpe, Pengzhan Zhao, Jonathan Eyolfson, and Yifan Qiao, UCLA; Zhihao Jia, CMU; Minjia Zhang, Microsoft Research; Ravi Netravali, Princeton University; Guoqing Harry Xu, UCLA
Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning Pengfei Zheng and Rui Pan, University of Wisconsin-Madison; Tarannum Khan, The University of Texas at Austin; Shivaram Venkataraman, University of Wisconsin-Madison; Aditya Akella, The University of Texas at Austin
TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs Weiyang Wang, Moein Khazraee, Zhizhen Zhong, and Manya Ghobadi, Massachusetts Institute of Technology; Zhihao Jia, Meta and CMU; Dheevatsa Mudigere and Ying Zhang, Meta; Anthony Kewitsch, Telescent
ModelKeeper: Accelerating DNN Training via Automated Training Warmup Fan Lai, Yinwei Dai, Harsha V. Madhyastha, and Mosharaf Chowdhury, University of Michigan
SHEPHERD: Serving DNNs in the Wild Hong Zhang, University of Waterloo; Yupeng Tang and Anurag Khandelwal, Yale University; Ion Stoica, UC Berkeley
Better Together: Jointly Optimizing ML Collective Scheduling and Execution Planning using SYNDICATE Kshiteej Mahajan, University of Wisconsin - Madison; Ching-Hsiang Chu and Srinivas Sridharan, Facebook; Aditya Akella, UT Austin
On Modular Learning of Distributed Systems for Predicting End-to-End Latency Chieh-Jan Mike Liang, Microsoft Research; Zilin Fang, Carnegie Mellon University; Yuqing Xie, Tsinghua University; Fan Yang, Microsoft Research; Zhao Lucis Li, University of Science and Technology of China; Li Lyna Zhang, Mao Yang, and Lidong Zhou, Microsoft Research
SelfTune: Tuning Cluster Managers Ajaykrishna Karthikeyan and Nagarajan Natarajan, Microsoft Research; Gagan Somashekar, Stony Brook University; Lei Zhao, Microsoft; Ranjita Bhagwan, Microsoft Research; Rodrigo Fonseca, Tatiana Racheva, and Yogesh Bansal, Microsoft
Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters | Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu (University of Hong Kong), and Chuanxiong Guo (Bytedance Inc.)
Dynamic Control Flow in Large-Scale Machine Learning | Yuan Yu (Microsoft), Martin Abadi, Paul Barham, Eugene Brevdo, Mike Burrows, Andy Davis, Jeff Dean, Sanjay Ghemawat (Google), Tim Harley (DeepMind), Peter Hawkins, Michael Isard (Google), Manjunath Kudlur (Cerebras), Rajat Monga, Derek Murray, and Xiaoqiang Zheng (Google)
Improving the Expressiveness of Deep Learning Frameworks with Recursion | Eunji Jeong, Joo Seong Jeong, Soojeong Kim, Gyeong-In Yu, and Byung-Gon Chun (Seoul National University)
Low Latency RNN Inference with Cellular Batching | Pin Gao (Tsinghua University), Lingfan Yu (New York University), Yongwei Wu (Tsinghua University), and Jinyang Li (New York University)
Supporting Very Large Models using Dataflow Graph Partitioning Minjie Wang, Chien-chin Huang, and Jinyang Li (NYU)
GRNN: Low-Latency and Scalable RNN Inference on GPUs Connor Holmes and Daniel Mawhirter (Colorado School of Mines); Yuxiong He (Microsoft Business AI and Research); Feng Yan (University of Nevada, Reno); Bo Wu (Colorado School of Mines)
Automating Dependence-Aware Parallelization of Machine Learning Training on Distributed Shared Memory Jinliang Wei (Carnegie Mellon University); Garth Gibson (Vector Institute, Carnegie Mellon University, University of Toronto); Philip Gibbons (Carnegie Mellon University); Eric Xing (Petuum, Carnegie Mellon University)
Parallax: Sparsity-aware Data Parallel Training of Deep Neural Networks Soojeong Kim, Gyeong-In Yu, Hojin Park, Sungwoo Cho, Eunji Jeong, Hyeonmin Ha, Sanha Lee, Joo Seong Jeong, and Byung-Gon Chun (Seoul National University)
Borg: the Next Generation Muhammad Tirmazi (Harvard University), Adam Barker (Google and University of St Andrews), Nan Deng, Md Ehtesam Haque, Zhijing Gene Qin, Steven Hand (Google), Mor Harchol-Balter (Carnegie Mellon University), John Wilkes (Google)
AlloX: Compute Allocation in Hybrid Clusters Tan N. Le (SUNY Korea, Stony Brook University), Xiao Sun (Stony Brook University), Mosharaf Chowdhury (University of Michigan), Zhenhua Liu (Stony Brook University)
Balancing Efficiency and Fairness in Heterogeneous GPU Clusters for Deep Learning Shubham Chaudhary, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Srinidhi Viswanatha (Microsoft Research India)
FlexGraph: A flexible and efficient distributed framework for GNN training, Lei Wang (Alibaba Group, China), Qiang Yin (Alibaba Group), Chao Tian (Alibaba Group), Jianbang Yang (Shanghai Jiao Tong University), Rong Chen (Shanghai Jiao Tong University, China), Wenyuan Yu (Alibaba Group, China), Zihang Yao (Shanghai Jiao Tong University), Jingren Zhou (Alibaba Group), Qiang Yin (Alibaba Group, China).
DGCL: An Efficient Communication Library for Distributed GNN Training, Zhenkun Cai (The Chinese University of Hong Kong), Xiao Yan (Southern University of Science and Technology), Yidi Wu (The Chinese University of Hong Kong), Kaihao Ma (The Chinese University of Hong Kong), James Cheng (The Chinese University of Hong Kong), Fan Yu (Huawei Technologies Co. Ltd).
Seastar: Vertex-Centric Programming for Graph Neural Networks, Yidi Wu (The Chinese University of Hong Kong), Kaihao Ma (The Chinese University of Hong Kong), Zhenkun Cai (The Chinese University of Hong Kong), Tatiana Jin (The Chinese University of Hong Kong), Boyang Li (The Chinese University of Hong Kong, China), Chenguang Zheng (The Chinese University of Hong Kong, China), James Cheng (The Chinese University of Hong Kong), Fan Yu (Huawei Technologies Co. Ltd).
Accelerating Graph Sampling for Graph Machine Learning using GPUs, Abhinav Jangda (University of Massachusetts Amherst, United States of America), Sandeep Polisetty (University of Massachusetts Amherst), Arjun Guha (Northeastern University, United States of America), Marco Serafini (University of Massachusetts Amherst, United States of America).
Rubberband: Cloud-based Hyperparameter Tuning, Richard Liaw (UC Berkeley), Ujval Misra (UC Berkeley), Lisa Dunlap (UC Berkeley), Joseph Gonzalez (UC Berkeley, United States of America), Ion Stoica (UC Berkeley, United States of America), Alexey Tumanov (Georgia Tech, United States of America), Kirthevasan Kandasamy (UC Berkeley), Romil Bhardwaj (UC Berkeley, United States of America).
Tahoe: Tree Structure-Aware High Performance Inference Engine for Decision Tree Ensemble on GPU, Zhen Xie (University of California, Merced), Wenqian Dong (University of California, Merced), Jiawen Liu (University of California, Merced), Hang Liu (Stevens Institute of Technology, United States of America), Dong Li (University of California, Merced).
Fleche: An Efficient GPU Embedding Cache for Personalized Recommendations. Minhui Xie, Youyou Lu, Jiazhen Lin, Qing Wang, and Jian Gao (Tsinghua University), Kai Ren (Kuaishou Technology), Jiwu Shu (Tsinghua University)
GNNLab: A Factored System for Sample-based GNN Training over GPUs. Jianbang Yang (IPADS, Shanghai Jiao Tong University), Dahai Tang (Hunan University), Xiaoniu Song (IPADS, Shanghai Jiao Tong University, Shanghai AI Laboratory), Lei Wang (Alibaba Group), Qiang Yin (BASICS, Shanghai Jiao Tong University), Rong Chen (IPADS, Shanghai Jiao Tong University, Shanghai AI Laboratory), Wenyuan Yu and Jingren Zhou (Alibaba Group)
Out-Of-Order BackProp: An Effective Scheduling Technique for Deep Learning. Hyungjun Oh, Junyeol Lee, Hyeongju Kim, and Jiwon Seo (Hanyang University)
D3: A Dynamic Deadline-Driven Approach for Building Autonomous Vehicles. Ionel Gog, Sukrit Kalra, Peter Schafhalter, Joseph E. Gonzalez, Ion Stoica. (UC Berkeley)
SiloD: A Co-design of Caching and Scheduling for Deep Learning Clusters Hanyu Zhao (Peking University), Zhenhua Han (Microsoft Research), Zhi Yang (Peking University), Quanlu Zhang (Microsoft Research), Mingxia Li (USTC), Fan Yang (Microsoft Research), Qianxi Zhang (Microsoft Research), Binyang Li (Microsoft), Yuqing Yang (Microsoft Research), Lili Qiu (Microsoft Research), Lintao Zhang (BaseBit Technologies), Lidong Zhou (Microsoft Research)
MariusGNN: Resource-Efficient Out-of-Core Training of Graph Neural Networks Roger Waleffe (University of Wisconsin-Madison), Jason Mohoney (University of Wisconsin-Madison), Theodoros Rekatsinas (ETH Zurich), Shivaram Venkataraman (University of Wisconsin-Madison)
Hi-Speed DNN Training with Espresso: Unleashing the Full Potential of Gradient Compression with Near-Optimal Usage Strategies Zhuang Wang (Rice University), Haibin Lin (ByteDance Inc.), Yibo Zhu (ByteDance Inc.), T. S. Eugene Ng (Rice University)
Fast and Efficient Model Serving Using Multi-GPUs with Direct-Host-Access Jinwoo Jeong (Ajou University), Seungsu Baek (Ajou University), Jeongseob Ahn (Ajou University)
Tabi: An Efficient Multi-Level Inference System for Large Language Models Yiding Wang (Hong Kong University of Science and Technology), Kai Chen (Hong Kong University of Science and Technology), Haisheng Tan (University of Science and Technology of China), Kun Guo (Fuzhou University)
JOG: Joint Graph and Operator level Optimizations for Deep Learning Compilation Zhiying Xu (Nanjing University), Jiafan Xu (Nanjing University), Hongding Peng (Nanjing University), Wei Wang (Nanjing University), Xiaoliang Wang (Nanjing University), Haoran Wan (Nanjing University), Haipeng Dai (Nanjing University), Yixu Xu (Huawei Technologies), Hao Cheng (Huawei Technologies), Kun Wang (The Hong Kong Polytechnic University), Guihai Chen (Nanjing University)
Lyra: Elastic Scheduling for Deep Learning Clusters Jiamin Li (City University of Hong Kong), Hong Xu (The Chinese University of Hong Kong), Yibo Zhu (ByteDance Inc.), Zherui Liu (ByteDance Inc.), Chuanxiong Guo, Cong Wang (City University of Hong Kong)
Egeria: Efficient DNN Training with Knowledge-Guided Layer Freezing Yiding Wang (Hong Kong University of Science and Technology), Decang Sun (Hong Kong University of Science and Technology), Kai Chen (Hong Kong University of Science and Technology), Fan Lai (University of Michigan), Mosharaf Chowdhury (University of Michigan)
Pocket: ML Serving from the Edge Misun Park (Georgia Institute of Technology), Ketan Bhardwaj (Georgia Institute of Technology), Ada Gavrilovska (Georgia Institute of Technology)
Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs Yanhao Chen and Ari B. Hayes, Rutgers University; Chi Zhang, University of Pittsburgh; Timothy Salmon and Eddy Z. Zhang, Rutgers University
Litz: Elastic Framework for High-Performance Distributed Machine Learning Aurick Qiao, Petuum, Inc. and Carnegie Mellon University; Abutalib Aghayev, Carnegie Mellon University; Weiren Yu, Petuum, Inc. and Beihang University; Haoyang Chen and Qirong Ho, Petuum, Inc.; Garth A. Gibson, Carnegie Mellon University and Vector Institute; Eric P. Xing, Petuum, Inc. and Carnegie Mellon University
Cavs: An Efficient Runtime System for Dynamic Neural Networks Shizhen Xu, Carnegie Mellon University, Tsinghua University; Hao Zhang, Graham Neubig, and Wei Dai, Carnegie Mellon University, Petuum Inc.; Jin Kyu Kim, Carnegie Mellon University; Zhijie Deng, Tsinghua University; Qirong Ho, Petuum Inc.; Guangwen Yang, Tsinghua University; Eric P. Xing, Petuum Inc.
DeepCPU: Serving RNN-based Deep Learning Models 10x Faster Minjia Zhang, Samyam Rajbhandari, Wenhan Wang, and Yuxiong He, Microsoft AI and Research
SIMD-X: Programming and Processing of Graph Algorithms on GPUs Hang Liu, University of Massachusetts Lowell; H. Howie Huang, George Washington University
NeuGraph: Parallel Deep Neural Network Computation on Large Graphs Lingxiao Ma and Zhi Yang, Peking University; Youshan Miao, Jilong Xue, Ming Wu, and Lidong Zhou, Microsoft Research; Yafei Dai, Peking University
Cognitive SSD: A Deep Learning Engine for In-Storage Data Retrieval Shengwen Liang and Ying Wang, State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing; University of Chinese Academy of Sciences; Youyou Lu and Zhe Yang, Tsinghua University; Huawei Li and Xiaowei Li, State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing; University of Chinese Academy of Sciences
Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads Myeongjae Jeon, UNIST and Microsoft Research; Shivaram Venkataraman, University of Wisconsin and Microsoft Research; Amar Phanishayee and Junjie Qian, Microsoft Research; Wencong Xiao, Beihang University and Microsoft Research; Fan Yang, Microsoft Research
Optimizing CNN Model Inference on CPUs Yizhi Liu, Yao Wang, Ruofei Yu, Mu Li, Vin Sharma, and Yida Wang, Amazon
MArk: Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving Chengliang Zhang, Minchen Yu, and Wei Wang, Hong Kong University of Science and Technology; Feng Yan, University of Nevada, Reno
HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism Jay H. Park, Gyeongchan Yun, Chang M. Yi, Nguyen T. Nguyen, and Seungmin Lee, UNIST; Jaesik Choi, KAIST; Sam H. Noh and Young-ri Choi, UNIST
AutoSys: The Design and Operation of Learning-Augmented Systems Chieh-Jan Mike Liang, Hui Xue, Mao Yang, and Lidong Zhou, Microsoft Research; Lifei Zhu, Peking University and Microsoft Research; Zhao Lucis Li and Zibo Wang, University of Science and Technology of China and Microsoft Research; Qi Chen and Quanlu Zhang, Microsoft Research; Chuanjie Liu, Microsoft Bing Platform; Wenjun Dai, Microsoft Bing Ads
Daydream: Accurately Estimating the Efficacy of Optimizations for DNN Training Hongyu Zhu, University of Toronto & Vector Institute; Amar Phanishayee, Microsoft Research; Gennady Pekhimenko, University of Toronto & Vector Institute
NeuOS: A Latency-Predictable Multi-Dimensional Optimization Framework for DNN-driven Autonomous Systems Soroush Bateni and Cong Liu, University of Texas at Dallas
Scaph: Scalable GPU-Accelerated Graph Processing with Value-Driven Differential Scheduling Long Zheng, Xianliang Li, Yaohui Zheng, Yu Huang, Xiaofei Liao, and Hai Jin, Huazhong University of Science and Technology; Jingling Xue, UNSW Sydney; Zhiyuan Shao and Qiang-Sheng Hua, Huazhong University of Science and Technology
Octo: INT8 Training with Loss-aware Compensation and Backward Quantization for Tiny On-device Learning Qihua Zhou and Song Guo, Hong Kong Polytechnic University; Zhihao Qu, Hohai University; Jingcai Guo, Zhenda Xu, Jiewei Zhang, Tao Guo, and Boyuan Luo, Hong Kong Polytechnic University; Jingren Zhou, Alibaba Group
Fine-tuning giant neural networks on commodity hardware with automatic pipeline model parallelism Saar Eliad, Ido Hakimi, and Alon De Jagger, Department of Computer Science, Technion - Israel Institute of Technology; Mark Silberstein, Department of Computer Science and Department of Electrical Engineering, Technion - Israel Institute of Technology; Assaf Schuster, Department of Computer Science, Technion - Israel Institute of Technology
INFaaS: Automated Model-less Inference Serving Francisco Romero, Qian Li, Neeraja J. Yadwadkar, and Christos Kozyrakis, Stanford University
Jump-Starting Multivariate Time Series Anomaly Detection for Online Service Systems Minghua Ma, Tsinghua University, BNRist; Shenglin Zhang, Nankai University; Junjie Chen, Tianjin University; Jim Xu, Georgia Tech; Haozhe Li and Yongliang Lin, Nankai University; Xiaohui Nie, Tsinghua University, BNRist; Bo Zhou and Yong Wang, CNCERT/CC; Dan Pei, Tsinghua University, BNRist
Palleon: A Runtime System for Efficient Video Processing toward Dynamic Class Skew Boyuan Feng, Yuke Wang, Gushu Li, Yuan Xie, and Yufei Ding, University of California, Santa Barbara
Habitat: A Runtime-Based Computational Performance Predictor for Deep Neural Network Training Geoffrey X. Yu, University of Toronto/Vector Institute; Yubo Gao, University of Toronto; Pavel Golikov and Gennady Pekhimenko, University of Toronto/Vector Institute
Zico: Efficient GPU Memory Sharing for Concurrent DNN Training Gangmuk Lim, UNIST; Jeongseob Ahn, Ajou University; Wencong Xiao, Alibaba Group; Youngjin Kwon, KAIST; Myeongjae Jeon, UNIST
Refurbish Your Training Data: Reusing Partially Augmented Samples for Faster Deep Neural Network Training Gyewon Lee, Seoul National University and FriendliAI; Irene Lee, Georgia Institute of Technology; Hyeonmin Ha, Kyunggeun Lee, and Hwarim Hyun, Seoul National University; Ahnjae Shin and Byung-Gon Chun, Seoul National University and FriendliAI
ZeRO-Offload: Democratizing Billion-Scale Model Training Jie Ren, UC Merced; Samyam Rajbhandari, Reza Yazdani Aminabadi, and Olatunji Ruwase, Microsoft; Shuangyan Yang, UC Merced; Minjia Zhang, Microsoft; Dong Li, UC Merced; Yuxiong He, Microsoft
Faith: An Efficient Framework for Transformer Verification on GPUs Boyuan Feng, Tianqi Tang, Yuke Wang, Zhaodong Chen, Zheng Wang, Shu Yang, Yuan Xie, Yufei Ding, University of California, Santa Barbara
DVABatch: Diversity-aware Multi-Entry Multi-Exit Batching for Efficient Processing of DNN Services on GPUs Weihao Cui, Han Zhao, Quan Chen, Hao Wei, and Zirui Li, Shanghai Jiao Tong University; Deze Zeng, China University of Geosciences; Chao Li and Minyi Guo, Shanghai Jiao Tong University
Serving Heterogeneous Machine Learning Models on Multi-GPU Servers with Spatio-Temporal Sharing Seungbeom Choi, Sunho Lee, Yeonjae Kim, Jongse Park, Youngjin Kwon, and Jaehyuk Huh, KAIST
PilotFish: Harvesting Free Cycles of Cloud Gaming with Deep Learning Training Wei Zhang and Binghao Chen, Shanghai Jiao Tong University; Zhenhua Han, Microsoft Research; Quan Chen, Shanghai Jiao Tong University; Peng Cheng, Fan Yang, Ran Shu, and Yuqing Yang, Microsoft Research; Minyi Guo, Shanghai Jiao Tong University
Tetris: Memory-efficient Serverless Inference through Tensor Sharing Jie Li, Laiping Zhao, and Yanan Yang, Tianjin University; Kunlin Zhan, 58.com; Keqiu Li, Tianjin University
PetS: A Unified Framework for Parameter-Efficient Transformers Serving Zhe Zhou, Peking University; Xuechao Wei, Peking University, Alibaba Group; Jiejing Zhang, Alibaba Group; Guangyu Sun, Peking University
Campo: Cost-Aware Performance Optimization for Mixed-Precision Neural Network Training Xin He, CSEE, Hunan University & Xidian University; Jianhua Sun and Hao Chen, CSEE, Hunan University; Dong Li, University of California, Merced
Primo: Practical Learning-Augmented Systems with Interpretable Models Qinghao Hu, Nanyang Technological University; Harsha Nori, Microsoft; Peng Sun, SenseTime; Yonggang Wen and Tianwei Zhang, Nanyang Technological University
Cachew: Machine Learning Input Data Processing as a Service Dan Graur, Damien Aymon, Dan Kluser, and Tanguy Albrici, ETH Zurich; Chandramohan A. Thekkath, Google; Ana Klimovic, ETH Zurich
SOTER: Guarding Black-box Inference for General Neural Networks at the Edge Tianxiang Shen, Ji Qi, Jianyu Jiang, Xian Wang, Siyuan Wen, Xusheng Chen, and Shixiong Zhao, The University of Hong Kong; Sen Wang and Li Chen, Huawei Technologies; Xiapu Luo, The Hong Kong Polytechnic University; Fengwei Zhang, Southern University of Science and Technology (SUSTech); Heming Cui, The University of Hong Kong
tf.data: a machine learning data processing framework Derek G. Murray, Jiří Šimša, Ana Klimovic, Ihor Indyk
Analyzing and Mitigating Data Stalls in DNN Training Jayashree Mohan, Amar Phanishayee, Ashish Raniwala, Vijay Chidambaram
Hippo: sharing computations in hyper-parameter optimization. Ahnjae Shin, Joo Seong Jeong, Do Yoon Kim, Soyoung Jung, Byung-Gon Chun
WindTunnel: Towards Differentiable ML Pipelines Beyond a Single Model. Gyeong-In Yu, Saeed Amizadeh, Sehoon Kim, Artidoro Pagnoni, Ce Zhang, Byung-Gon Chun, Markus Weimer, Matteo Interlandi
HET-GMP: a Graph-based System Approach to Scaling Large Embedding Model Training Xupeng Miao (Peking University)*; Yining Shi (Peking University); Hailin Zhang (Peking University); Xin Zhang (Peking University); Xiaonan Nie (Peking University); Zhi Yang (Peking University); Bin Cui (Peking University)
Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines Alexander Isenko (Technical University of Munich)*; Ruben Mayer (Technical University of Munich); Jeffery Jedele (Technical University of Munich); Hans-Arno Jacobsen (University of Toronto)
Complaint-Driven Training Data Debugging at Interactive Speeds Lampros Flokas (Columbia University)*; Weiyuan Wu (Simon Fraser University); Yejia Liu (Simon Fraser University); Jiannan Wang (Simon Fraser University); Nakul Verma (Columbia University); Eugene Wu (Columbia University)
FuseME: Distributed Matrix Computation Engine based on Cuboid-based Fused Operator and Plan Generation Donghyoung Han (KAIST)*; Jongwuk Lee (Sungkyunkwan University); Min-Soo Kim (KAIST)
Sommelier: Curating DNN Models for the Masses Peizhen Guo (Yale University)*; Bo Hu (Yale University); Wenjun Hu (Yale University)
BlindFL: Vertical Federated Machine Learning without Peeking into Your Data "Fangcheng Fu (Peking University)*; Huanran Xue (Tencent Inc.); Yong Cheng ( Tencent Inc.); Yangyu Tao (Tencent Inc.); Bin Cui (Peking University)"
NeutronStar: Distributed GNN Training with Hybrid Dependency Management Qiange Wang (Northeastern University); Yanfeng Zhang (NorthEastern University)*; Hao Wang (the Ohio State University); Chaoyi Chen (Northeastern University); Xiaodong Zhang (Ohio State U.); Ge Yu (Northeast University)
End-to-end Optimization of Machine Learning Prediction Queries Kwanghyun Park (Microsoft)*; Karla Saur (Microsoft); Dalitso Banda (Microsoft); Rathijit Sen (Microsoft); Matteo Interlandi (Microsoft); Konstantinos Karanasos (Microsoft)
Hidden Technical Debt in Machine Learning Systems.
D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Franc¸ois Crespo, Dan Dennison.
Mesh-TensorFlow: Deep Learning for Supercomputers Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, Ryan Sepassi, Blake Hechtman
PyTorch: An Imperative Style, High-Performance Deep Learning Library Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, Soumith Chintala
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, zhifeng Chen
Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning Woosuk Kwon, Gyeong-In Yu, Eunji Jeong, Byung-Gon Chun
PyGlove: Symbolic Programming for Automated Machine Learning Daiyi Peng, Xuanyi Dong, Esteban Real, Mingxing Tan, Yifeng Lu, Gabriel Bender, Hanxiao Liu, Adam Kraft, Chen Liang, Quoc Le
Terra: Imperative-Symbolic Co-Execution of Imperative Deep Learning Programs Taebum Kim, Eunji Jeong, Geon-Woo Kim, Yunmo Koo, Sehoon Kim, Gyeongin Yu, Byung-Gon Chun
TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models Zhuohan Li · Siyuan Zhuang · Shiyuan Guo · Danyang Zhuo · Hao Zhang · Dawn Song · Ion Stoica
GShard: Scaling giant models with conditional computation and automatic sharding.
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen
Swizzle Inventor: Data Movement Synthesis for GPU Kernels Phitchaya Mangpo Phothilimthana (University of California, Berkeley);Archibald Samuel Elliott (University of Washington);Abhinav Jangda (University of Massachusetts Amherst);Bastian Hagedorn (University of Münster);Henrik Barthels (AICES, RWTH Aachen University);Rastislav Bodik (University of Washington);Vinod Grover (NVIDIA)
DeepSigns: An End-to-End Watermarking Framework for Protecting the Ownership of Deep Neural Networks Bita Darvish Rouhani, Huili Chen, Farinaz Koushanfar(UC San Diego)
DiGraph: An Efficient Path-based Iterative Directed Graph Processing System on Multiple GPUs Yu Zhang, Xiaofei Liao, Hai Jin (Huazhong University of Science and Technology);Bingsheng He (National University of Singapore);Haikun Liu, Lin Gu (Huazhong University of Science and Technology)
TANGRAM: Optimized Coarse-Grained Dataflow for Scalable NN Accelerators Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, Christos Kozyrakis (Stanford University)
Packing Sparse Convolutional Neural Networks for Efficient Systolic Array Implementations: Column Combining Under Joint Optimization HT Kung, Bradley McDanel, Sai Qian Zhang (Harvard University)
Split-CNN: Splitting Window-based Operations in Convolutional Neural Networks for Memory System Optimization Tian Jin (IBM T.J. Watson Research Center);Seokin Hong (Kyungpook National University)
HOP: Heterogeneity-Aware Decentralized Training Qinyi Luo (University of Southern California);Jinkun Lin (Tsinghua University);Youwei Zhuo, Xuehai Qian (University of Southern California)
Astra: Exploiting Predictability to Optimize Deep Learning
Muthian Sivathanu (Microsoft Research India);Tapan Chugh (Microsoft Research India);Sanjay Srivallabh (Microsoft Research India);Lidong Zhou (Microsoft Research Asia)
ADMM-NN: An Algorithm-Hardware Co-Design Framework of DNNs Using Alternating Direction Methods of Multipliers Ao Ren (Northeastern University);Jiayu Li, Tianyun Zhang, Shaokai Ye (Syracuse University);Wenyao Xu (SUNY, Buffalo);Xuehai Qian (University of Southern California);Xue Lin, Yanzhi Wang (Northeastern University)
Interstellar: Using Halide’s Scheduling Language to Analyze DNN Accelerators Xuan Yang (Stanford University); Mingyu Gao (Tsinghua University); Qiaoyi Liu (Stanford University); Jeff Setter (Stanford University); Jing Pu (Stanford University); Ankita Nayak (Stanford University); Steven Bell (Stanford University); Kaidi Cao (Stanford University); Heonjae Ha (Stanford University); Priyanka Raina (Stanford University); Christos Kozyrakis (Stanford University, Google); Mark Horowitz (Stanford University)
DeepSniffer: A DNN Model Extraction Framework Based on Learning Architectural Hints Xing Hu (University of California, Santa Barbara); Ling Liang (University of California, Santa Barbara); Shuangchen Li (University of California, Santa Barbara); Lei Deng (University of California, Santa Barbara & Tsinghua University); Pengfei Zuo (University of California, Santa Barbara & Huazhong University of Science and Technology); Yu Ju (University of California, Santa Barbara & Tsinghua University); Xinfeng Xie (University of California, Santa Barbara); Yufei Ding (University of California, Santa Barbara); Chang Liu (Citadel Securities); Timothy Sherwood (University of California, Santa Barbara); Yuan Xie (University of California, Santa Barbara)
Prague: High-Performance Heterogeneity-Aware Asynchronous Decentralized Training
Qinyi Luo (University of Southern California); Jiaao He (Tsinghua University); Youwei Zhuo (University of Southern California); Xuehai Qian (University of Southern California)
FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System Size Zheng (Peking University); Yun Liang (Peking University); Shuo Wang (Peking University); Renze Chen (Peking University); Kaiwen Sheng (Peking University)
AutoTM: Automatic Tensor Movement in Heterogeneous Memory Systems using Integer Linear Programming Mark Hildebrand (University of California, Davis); Jawad Khan (Intel Corporation); Sanjeev Trika (Intel Corporation); Jason Lowe-Power (University of California, Davis); Venkatesh Akella (University of California, Davis)
Capuchin: Tensor-based GPU Memory Management for Deep Learning Xuan Peng (Huazhong University of Science and Technology); Xuanhua Shi (Huazhong University of Science and Technology); Hulin Dai (Huazhong University of Science and Technology); Hai Jin (Huazhong University of Science and Technology); Weiliang Ma (Huazhong University of Science and Technology); Qian Xiong (Huazhong University of Science and Technology); Fan Yang (Microsoft Research Asia); Xuehai Qian (University of Southern California)
SwapAdvisor: Pushing Deep Learning Beyond the GPU Memory Limit via Smart Swapping Chien-Chin Huang (New York University); Gu Jin (New York University); Jinyang Li (New York University)
Analytical Characterization and Design Space Exploration for Optimization of CNNs Rui Li, Yufan Xu (University of Utah); Aravind Sukumaran-Rajam (Washington State University); Atanas Rountev (Ohio State University); P. Sadayappan (University of Utah)
A Full-stack Search Technique for Domain Optimized Deep Learning Accelerators Dan Zhang (Google Brain), Safeen Huda (Google), Ebrahim Songhori (Google Brain), Kartik Prabhu (Stanford University), Quoc Le (Google Brain), Anna Goldie (Google Brain), Azalia Mirhoseini (Google Brain)
ValueExpert: Exploring Value Patterns in GPU-accelerated Applications Keren Zhou (Rice University), Yueming Hao (North Carolina State University), John Mellor-Crummey (Rice University), Xiaozhu Meng (Rice University), Xu Liu (North Carolina State University), (Oak Ridge National Laboratory)
RecShard: Statistical Feature-Based Memory Optimization for Industry-Scale Neural Recommendation Geet Sethi (Stanford University), (Meta), Bilge Acun (Meta), Niket Agarwal (Meta), Christos Kozyrakis (Stanford University), Caroline Trippel (Stanford University), Carole-Jean Wu (Meta)
AStitch: Enabling A New Multi-Dimensional Optimization Space for Memory-Intensive ML Training and Inference on Modern SIMT Architectures Zhen Zheng (Alibaba Group), Xuanda Yang (Alibaba Group), Pengzhan Zhao (Alibaba Group), Guoping Long (Alibaba Group), Kai Zhu (Alibaba Group), Feiwen Zhu (Alibaba Group), Wenyi Zhao (Alibaba Group), Xiaoyong Liu (Alibaba Group), Jun Yang (Alibaba), Jidong Zhai (Tsinghua University), Shuaiwen Leon Song (University of Sydney & University of Washington), Wei Lin (Alibaba Group)
NASPipe: High Performance and Reproducible Pipeline Parallel Supernet Training via Causal Synchronous Parallelism Shixiong Zhao (University of Hong Kong), Fanxin Li (University of Hong Kong), Xusheng Chen (University of Hong Kong), Tianxiang Shen (University of Hong Kong), Li Chen (Huawei Technologies), Sen Wang (Huawei Technologies), Nicholas Zhang (Huawei Technologies), Cheng Li (University of Science and Technology of China), Heming Cui (University of Hong Kong)
VELTAIR: Towards High-Performance Multi-Tenant Deep Learning Services via Adaptive Compilation and Scheduling Zihan Liu (Shanghai Jiao Tong University), Jingwen Leng (Shanghai Jiao Tong University), Zhihui Zhang (Shanghai Jiao Tong University), Quan Chen (Shanghai Jiao Tong University), Chao Li (Shanghai Jiao Tong University), Minyi Guo (Shanghai Jiao Tong University)
Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads Abhinav Jangda (University of Massachusetts at Amherst), Jun Huang (Ohio State University), Guodong Liu (Chinese Academy of Sciences), Amir Hossein Nodehi Sabet (University of California at Riverside), Saeed Maleki (Microsoft Research), Youshan Miao (Microsoft Research), Madanlal Musuvathi (Microsoft Research), Todd Mytkowicz (Microsoft Research), Olli Saarikivi (Microsoft Research)
Astraea: Towards QoS-Aware and Resource-Efficient Multi-stage GPU Services Wei Zhang (Shanghai Jiao Tong University), Quan Chen (Shanghai Jiao Tong University), Kaihua Fu (Shanghai Jiao Tong University), Ningxin Zheng (Microsoft Research), Zhiyi Huang (University of Otago), Jingwen Leng (Shanghai Jiao Tong University), Minyi Guo (Shanghai Jiao Tong University)
SOL: Safe On-Node Learning in Cloud Platforms Yawen Wang (Stanford University), Daniel Crankshaw (Microsoft Research), Neeraja J. Yadwadkar (University of Texas at Austin), Daniel Berger (Microsoft Research), Christos Kozyrakis (Stanford University), Ricardo Bianchini (Microsoft Research)
KAISA: An Adaptive Second-Order Optimizer Framework for Deep Neural Networks J. Gregory Pauloski,Qi Huang,Lei Huang,Shivaram Venkataraman,Kyle ChardIan Foster,Zhao Zhang
Tensor Processing Primitives: A Programming Abstraction for Efficiency and Portability in Deep Learning Workloads Evangelos Georganas,Dhiraj Kalamkar,Sasikanth Avancha,Menachem Adelman,Cristina Anderson,Alexander Breuer,Jeremy Bruestle,Narendra Chaudhary,Abhisek Kundu,Denise Kutnick,Frank Laub,Vasimuddin MdSanchit Misra,Ramanarayan Mohanty,Hans Pabst,Barukh Ziv,Alexander Heinecke
Enable Simultaneous DNN Services Based on Deterministic Operator Overlap and Precise Latency Prediction Weihao Cui,Han Zhao,Quan Chen,Ningxin Zheng,Jingwen Leng,Jieru Zhao,Zhuo Song,Tao Ma,Yong Yang,Chao Li,Minyi Guo
ET: Re-Thinking Self-Attention for Transformer Models on GPUs Shiyang Chen, Shaoyi Huang, Santosh Pandey, Bingbing Li, Guang R. Gao, Long Zheng, Caiwen Ding, Hang Liu
Parallel Construction of Module Networks Ankit Srivastava, Sriram Chockalingam, Maneesha Aluru, Srinivas Aluru
Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines Shigang Li, Torsten Hoefler
APNN-TC: Accelerating Arbitrary Precision Neural Networks on Ampere GPU Tensor Cores Boyuan Feng, Yuke Wang, Tong Geng, Ang Li, Yufei Ding
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, Matei Zaharia
ZeRO-infinity: breaking the GPU memory wall for extreme scale deep learning Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, Yuxiong He
FedAT: A High-Performance and Communication-Efficient Federated Learning System with Asynchronous Tiers Zheng Chai, Yujing Chen, Ali Anwar,Liang Zhao,Yue Cheng,Huzefa Rangwala
DistGNN: Scalable Distributed Training for Large-Scale Graph Neural Networks Vasimuddin Md,Sanchit Misra,Guixiang MaRamanarayan Mohanty,Evangelos Georganas,Alexander Heinecke,Dhiraj Kalamkar,Nesreen K. Ahmed,Sasikanth Avancha
Efficient Scaling of Dynamic Graph Neural Networks Venkatesan T. Chakaravarthy,Shivmaran S. Pandian,Saurabh Raje,Yogish Sabharwal,Toyotaro Suzumura,Shashanka Ubaru Efficient Tensor Core-Based GPU Kernels for Structured Sparsity Under Reduced Precision Zhaodong Chen,Zheng Qu,Liu Liu,Yufei Ding,Yuan Xie
MAPA: Multi-Accelerator Pattern Allocation Policy for Multi-Tenant GPU Servers Kiran Ranganath, Joshua D. Suetterlein,Joseph Manzano,Shuaiwen Leon Song, Daniel Wong
Online Evolutionary Batch Size Orchestration for Scheduling Deep Learning Workloads in GPU Clusters Zhengda Bian, Shenggui Li, Wei Wang, Yang You
Efficient Quantized Sparse Matrix Operations on Tensor Cores Shigang Li, Kazuki Osawa, Torsten Hoefler
LightSeq2: Accelerated Training for Transformer-Based Models on GPUs Xiaohui Wang, Yang Wei, Ying Xiong, Guyue Huang, Xian Qian, Yufei Ding, Mingxuan Wang, Lei Li
CoGNN: Efficient Scheduling for Concurrent GNN Training on GPUs Qingxiao Sun,Yi Liu,Hailong Yang,Ruizhe Zhang,Ming Dun,Mingzhen Li,Xiaoyan Liu,Wencong Xiao,Yong Li,Zhongzhi Luan,Depei Qian
DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale Reza Yazdani Aminabadi,Samyam Rajbhandari,Ammar Ahmad Awan,Cheng Li,Du Li,Elton Zheng,Olatunji Ruwase,Shaden Smith,Minjia Zhang,Jeff Rasley,Yuxiong He
VSGM: View-Based GPU-Accelerated Subgraph Matching on Large Graphs Guanxian Jiang,Qihui Zhou,Tatiana Jin,Boyang Li,Yunjian Zhao,Yichao Li,James Cheng
STMatch: Accelerating Graph Pattern Matching on GPU with Stack-Based Loop Optimizations Yihua Wei, Peng Jiang
WholeGraph: A Fast Graph Neural Network Training Framework with Multi-GPU Distributed Shared Memory Architecture Dongxu Yang,Junhong Liu,Jiaxing Qi,Junjie Lai
SpDISTAL: Compiling Distributed Sparse Tensor Computations Rohan Yadav,Alex Aiken,Fredrik Kjolstad
EL-Rec: Efficient Large-Scale Recommendation Model Training via Tensor-Train Embedding Table Zheng Wang,Yuke Wang,Boyuan Feng,Dheevatsa Mudigere,Bharath Muthiah,Yufei Ding
STRONGHOLD: Fast and Affordable Billion-Scale Deep Learning Model Training Xiaoyang Sun,Wei Wang,Shenghao Qiu,Renyu Yang,Songfang Huang,Jie Xu,Zheng Wang
HGL: Accelerating Heterogeneous GNN Training with Holistic Representation and Optimization Yuntao Gui,Yidi Wu,Han Yang,Tatiana Jin,Boyang Li,Qihui Zhou,James Cheng,Fan Yu
Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective. Kim M. Hazelwood, Sarah Bird, David M. Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia, Aditya Kalro, James Law, Kevin Lee, Jason Lu, Pieter Noordhuis, Misha Smelyanskiy, Liang Xiong, Xiaodong Wang
The Architectural Implications of Facebook’s DNN-based Personalized Recommendation.
Udit Gupta, Carole-Jean Wu, Xiaodong Wang, Maxim Naumov, Brandon Reagen David Brooks, Bradford Cottel, Kim Hazelwood, Mark Hempstead, Bill Jia, Hsien-Hsin S. Lee, Andrey Malevich, Dheevatsa Mudigere, Mikhail Smelyanskiy, Liang Xiong, Xuan Zhang
Compiling machine learning programs via high-level tracing.
Roy Frostig, Matthew James Johnson, Chris Leary.
Beyond Data and Model Parallelism for Deep Neural Networks. Zhihao Jia, Matei Zaharia, Alex Aiken
MLPerf Training Benchmark.
Peter Mattson, Christine Cheng, Gregory Diamos, Cody Coleman, Paulius Micikevicius, David Patterson, Hanlin Tang, Gu-Yeon Wei, Peter Bailis, Victor Bittorf, David Brooks, Dehao Chen, Debo Dutta, Udit Gupta, Kim Hazelwood, Andy Hock, Xinyuan Huang, Daniel Kang, David Kanter, Naveen Kumar, Jeffery Liao, Deepak Narayanan, Tayo Oguntebi, Gennady Pekhimenko, Lillian Pentecost, Vijay Janapa Reddi, Taylor Robie, Tom St John, Carole-Jean Wu, Lingjie Xu, Cliff Young, Matei Zaharia
Exploring the limits of Concurrency in ML Training on Google TPUs.
Sameer Kumar, James Bradbury, Cliff Young, Yu Emma Wang, Anselm Levskaya, Blake Hechtman, Dehao Chen, HyoukJoong Lee, Mehmet Deveci, Naveen Kumar, Pankaj Kanwar, Shibo Wang, Skye Wanderman-Milne, Steve Lacy, Tao Wang, Tayo Oguntebi, Yazhou Zu, Yuanzhong Xu, Andy Swing
Pathways: Asynchronous Distributed Dataflow for ML Paul Barham, Aakanksha Chowdhery, Jeff Dean, Sanjay Ghemawat, Steven Hand, Daniel Hurt, Michael Isard, Hyeontaek Lim, Ruoming Pang, Sudip Roy, Brennan Saeta, Parker Schuh, Ryan Sepassi, Laurent Shafey , Chandu Thekkath, Yonghui Wu
DietCode: Automatic Optimization for Dynamic Tensor Programs Bojian Zheng, Ziheng Jiang, Cody Hao Yu, Haichen Shen, Joshua Fromm, Yizhi Liu, Yida Wang, Luis Ceze, Tianqi Chen, Gennady Pekhimenko
Synthesizing Optimal Collective Algorithms Zixian Cai Australian National University, Zhengyang Liu University of Utah, Saeed Maleki Microsoft Research, Madan Musuvathi Microsoft Research, Todd Mytkowicz Microsoft Research, Jacob Nelson Microsoft Research, Olli Saarikivi Microsoft Research, Redmond
Understanding and Bridging the Gaps in Current GNN Performance Optimizations
Kezhao Huang Tsinghua University, Jidong Zhai Tsinghua University, Zhen Zheng Alibaba Group, Youngmin Yi University of Seoul, Xipeng Shen North Carolina State University
I/O Lower Bounds for Auto-tuning of Convolutions in CNNs
Xiaoyang Zhang Institute of Computing Technology, Chinese Academy of Sciences, Junmin Xiao Institute of Computing Technology, Chinese Academy of Sciences, Guangming Tan Institute of Computing Technology, Chinese Academy of Sciences
TurboTransformers: An Efficient GPU Serving System For Transformer Models
Jiarui Fang Tencent, Yang Yu , Chengduo Zhao Tencent, Jie Zhou Tencent
DAPPLE: A Pipelined Data Parallel Approach for Training Large Models
Shiqing Fan Alibaba Group, Yi Rong Alibaba Group, Chen Meng Alibaba Group, ZongYan Cao Alibaba Group, Siyu Wang Alibaba Group, Zhen Zheng Alibaba Group, Chuan Wu The University of Hong Kong, Guoping Long Alibaba Group, Jun Yang Alibaba Group, LiXue Xia Alibaba Group, Lansong Diao Alibaba Group, Xiaoyong Liu Alibaba Group, Wei Lin Alibaba Group
CASE: A Compiler-Assisted SchEduling Framework for Multi-GPU Systems
Chao Chen Amazon Web Service, Chris Porter Georgia Institute of Technology, USA, Santosh Pande Georgia Institute of Technology
TileSpGEMM: A Tiled Algorithm for Parallel Sparse General Matrix-Matrix Multiplication on GPUs
Yuyao Niu China University of Petroleum-Beijing, Zhengyang Lu China University of Petroleum-Beijing, Haonan Ji China University of Petroleum-Beijing, Shuhui Song China University of Petroleum-Beijing, Zhou Jin China University of Petroleum-Beijing, Weifeng Liu China University of Petroleum-Beijing
QGTC: Accelerating Quantized Graph Neural Networks via GPU Tensor Core
Yuke Wang UC Santa Barbara, Boyuan Feng University of California Santa Barbara, Yufei Ding University of California at Santa Barbara
FasterMoE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models
Jiaao He Tsinghua University, China, Jidong Zhai Tsinghua University, Tiago Antunes Tsinghua University, Haojie Wang Tsinghua University, Fuwen Luo Tsinghua University, Shangfeng Shi Tsinghua University, Qin Li Tsinghua University
Near-Optimal Sparse Allreduce for Distributed Deep Learning
Shigang Li ETH Zurich, Torsten Hoefler ETH Zurich
BAGUALU: Targeting Brain Scale Pretrained Models with over 37 Million Cores
Zixuan Ma Tsinghua University, Jiaao He Tsinghua University, China, Jiezhong Qiu Tsinghua University and Beijing Academy of Artificial Intelligence, Huanqi Cao Tsinghua University, Yuanwei Wang Tsinghua University, Zhenbo Sun Tsinghua University, Liyan Zheng Tsinghua University, Haojie Wang Tsinghua University, Shizhi Tang Tsinghua University, Tianyu Zheng Zhejiang Lab, Junyang Lin DAMO Academy, Alibaba Group, Guanyu Feng Tsinghua University, Zeqiang Huang Zhejiang Lab, Jie Gao Zhejiang Lab, Aohan Zeng Tsinghua University and Beijing Academy of Artificial Intelligence, Jianwei Zhang DAMO Academy, Alibaba Group, Runxin Zhong Tsinghua University, Tianhui Shi Tsinghua University, Sha Liu Zhejiang Lab, Weimin Zheng Tsinghua University, Jie Tang Tsinghua University and Beijing Academy of Artificial Intelligence, Hongxia Yang DAMO Academy, Alibaba Group, Xin Liu Zhejiang Lab, Jidong Zhai Tsinghua University, Wenguang Chen Tsinghua University
WISE: Predicting the Performance of Sparse Matrix Vector Multiplication with Machine Learning
Serif Yesil University of Illinois Urbana-Champaign, Azin Heidarshenas University of Illinois Urbana-Champaign, Adam Morrison Tel Aviv University, Josep Torrellas University of Illinois at Urbana-Champaign
TGOpt: Redundancy-Aware Optimizations for Temporal Graph Attention Networks
Yufeng Wang University of Illinois at Urbana-Champaign, Charith Mendis University of Illinois at Urbana-Champaign
Dynamic N:M Fine-grained Structured Sparse Attention Mechanism
Zhaodong Chen University of California, Santa Barbara, Zheng Qu University of California, Santa Barbara, Yuying Quan University of California, Santa Barbara, Liu Liu , Yufei Ding UC Santa Barbara, Yuan Xie UCSB
Elastic Averaging for Efficient Pipelined DNN TrainingZihao Chen East China Normal University, Chen Xu East China Normal University, Weining Qian East China Normal University, Aoying Zhou East China Normal University
DSP: Efficient GNN Training with Multiple GPUs
Zhenkun Cai The Chinese University of Hong Kong, Qihui Zhou The Chinese University of Hong Kong, Xiao Yan Southern University of Science and Technology, Da Zheng Amazon Web Services, Xiang Song Amazon Web Services, Chenguang Zheng The Chinese University of Hong Kong, James Cheng The Chinese University of Hong Kong, George Karypis Amazon Web Services
PiPAD: Pipelined and Parallel Dynamic GNN Training on GPUsChunyang Wang Beihang University, Desen Sun Beihang University, Yuebin Bai Beihang University
mGEMM: Low-latency Convolution with Minimal Memory Overhead Optimized for Mobile Devices Jongseok Park, Kyungmin Bin, Kyunghan Lee (Seoul National University)
Band: Coordinated Multi-DNN Inference on Heterogeneous Mobile Processors Joo Seong Jeong, Jingyu Lee, Donghyun Kim, Changmin Jeon, Changjin Jeong, Youngki Lee (Seoul National University); Byung-Gon Chun (Seoul National University, FriendliAI)
CoDL: Efficient CPU-GPU Co-execution for Deep Learning Inference on Mobile Devices Fucheng Jia, Deyu Zhang (Central South University); Ting Cao, Shiqi Jiang (Microsoft Research); Yunxin Liu, Ju Ren, Yaoxue Zhang (Tsinghua University)
FedBalancer: Data and Pace Control for Efficient Federated Learning on Heterogeneous Clients Jaemin Shin (School of Computing, KAIST); Yuanchun Li, Yunxin Liu (Institute for AI Industry Research (AIR), Tsinghua University); Sung-Ju Lee (School of Electrical Engineering, KAIST)
Memory-efficient DNN Training on Mobile Devices In Gim, JeongGil Ko(Yonsei University)
Melon: Breaking the Memory Wall for Resource-Efficient On-Device Machine Learning Qipeng Wang (Peking University); Mengwei Xu (Beijing University of Posts and Telecommunications); Chao Jin, Xinran Dong (Peking University); Jinliang Yuan (Beijing University of Posts and Telecommunications); Xin Jin, Gang Huang (Peking University); Yunxin Liu (Institute for AI Industry Research (AIR), Tsinghua University); Xuanzhe Liu (Peking University)
MLPerf Inference Benchmark.
Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breughe, Mark Charlebois, William Chou, Ramesh Chukka, Cody Coleman, Sam Davis, Pan Deng, Greg Diamos, Jared Duke, Dave Fick, J. Scott Gardner, Itay Hubara, Sachin Idgunji, Thomas B. Jablin, Jeff Jiao, Tom St. John, Pankaj Kanwar, David Lee, Jeffery Liao, Anton Lokhmotov, Francisco Massa, Peng Meng, Paulius Micikevicius, Colin Osborne, Gennady Pekhimenko, Arun Tejusve Raghunath Rajan, Dilip Sequeira, Ashish Sirasao, Fei Sun, Hanlin Tang, Michael Thomson, Frank Wei, Ephrem Wu, Lingjie Xu, Koichi Yamada, Bing Yu, George Yuan, Aaron Zhong, Peizhao Zhang, Yuchen Zhou
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro