[ASoC 2022] Implement native pytorch elastic training fashion based on torch-elastic protocol. #251
Labels
asoc2022
Alibaba Summer of Code, 2022
community
Community discussions
enhancement
New feature or request
Background:
As the official portal introduced, torch-elastic has been upstreamed to pytorch >=1.9. KubeDL manages the lifecycle of jobs and orchestrate their resources, it is critical to implement torch-elastic distributed training protocol and brings a fault-tolerance & elastic experience, therefore, job completion time(JCT) can be significantly shortened while resources(both cpu/memory and gpus) be better utilized.
Goals to be achieved:
Additional context:
This issue is part of our ASoC 2022 Program.
Difficulty: Normal
Mentor: Qiukai Chen (@SimonCqk )
The text was updated successfully, but these errors were encountered: