-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MNIST test to run multi-node distributed training using KFTO #295
Add MNIST test to run multi-node distributed training using KFTO #295
Conversation
1c0a831
to
d2e2371
Compare
Is there any specific reason not to use the existing hf_llm_training.py script which is used in single node KFTO tests. I think test coverage for multi nodes can be done by parameterizing the existing tests to run on multiple nodes. |
@ChughShilpa |
5f7ff0f
to
c095436
Compare
…VIDIA-CUDA/AMD-ROCm GPUs
c095436
to
9ae3ffd
Compare
kfto-mnist-vwcfj-master-0-pytorch.log |
…s on different pods and to storage output model using PersistentVolumeClaim with RWX access mode
…s on different cluster GPU/CPU nodes
a325059
to
4b92b1d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
5674d65
to
1c36d60
Compare
@abhijeet-dhumal I see you set the epochs to 1 in test, IMO we should be using higher number of epochs |
@ChughShilpa |
I think setting higher epochs provides benefits in identify resource issues or memory leaks that may arise during prolonged training. Also training the model in a more realistic scenario.
|
@ChughShilpa @sutaakar I have tested all below cases:- Training with:
I will update this test accordingly, Thanks ! |
1c36d60
to
afdf49b
Compare
/lgtm |
CPU (Backend type - GLOO) : (WorkerReplicas-2, Epochs-3) GPU (backend type - NCCL) : (WorkerReplicas-1, Epochs-3) Note : Initially incase of pulling training image it takes approx 10/12mins, in image below it took 14m 58s to run TestPytorchjobMnistWithCuda test |
…to resolve fsspec/numpy package compatibility issue and added license in MNIST script
afdf49b
to
7dbfdfc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ChughShilpa, sutaakar The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Description
Added MNIST test to run KFTO Multi-node distributed training using NVIDIA(CUDA) and AMD GPUs(ROCm)
RHOAIENG-14540
RHOAIENG-14541
How Has This Been Tested?
For running test using AMD GPUs, use ROCM image :
quay.io/modh/ray@sha256:db667df1bc437a7b0965e8031e905d3ab04b86390d764d120e05ea5a5c18d1b4
Tested manually :
Note: Usually takes more time to pull training image initially
Merge criteria: