Skip to content

Latest commit

 

History

History
36 lines (33 loc) · 2.84 KB

README.md

File metadata and controls

36 lines (33 loc) · 2.84 KB

λDNN: Achieving Predictable Distributed DNN Training with Serverless Architectures

λDNN is a cost-efficient function resource provisioning framework to minimize the monetary cost and guarantee the performance for DDNN training workloads in serverless platforms.

Overview of λDNN

λDNN framework running on AWS Lambda and comprises two pieces of modules: a training performance predictor and a function resource provisioner. To guarantee the objective DDNN training time, the resource provisioner further identifies the cost-efficient serverless function resource provisioning plan. Once the cost-efficient resource provisioning plan is determined, the function allocator finally sets up a number of functions with an appropriate amount of memory.

Modeling DDNN Training Performance In Serverless Platforms

In general, the DNN model requires a number of iterations (denoted by k) to converge to an objective training loss value. Accordingly, the DDNN training time T can be calculated by summing up the loading time, and the computation time, as well as the communication time, which is given by

The loading time is calculated as
Given n provisioned functions, the computation time tcomp of model gradients is defined as
The data communication time is calculated as
The objective is to minimize the monetary cost of provisioned function resources, while guaranteeing the performance of DDNN training workloads. The optimization problem is formally defined as

Publication

Fei Xu, Yiling Qin, Li Chen, Zhi Zhou, Fangming Liu, “λDNN: Achieving Predictable Distributed DNN Training with Serverless Architectures,” IEEE Transactions on Computers, 2022, 71(2): 450-463. DOI:10.1109/TC.2021.3054656.

@article{xu2021lambdadnn,
  title={$\lambda$dnn: Achieving predictable distributed DNN training with serverless architectures},
  author={Xu, Fei and Qin, Yiling and Chen, Li and Zhou, Zhi and Liu, Fangming},
  journal={IEEE Transactions on Computers},
  volume={71},
  number={2},
  pages={450--463},
  year={2021},
  publisher={IEEE}
}