Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-node Training Issue #23

Open
Amshaker opened this issue Nov 25, 2024 · 1 comment
Open

Multi-node Training Issue #23

Amshaker opened this issue Nov 25, 2024 · 1 comment

Comments

@Amshaker
Copy link

Hi @xiaoqian-shen,

Thank you for your great work on this project and for sharing the code!

I have successfully run the code on a single node with 8 GPUs. However, when attempting to scale to multi-node training, I encountered an issue with the current longvu/train.py, which seems to not fully support multi-node training.

I am trying to run multi-node training and the .sh should support that:

CUDA_LAUNCH_BLOCKING=1 TORCH_DISTRIBUTED_DEBUG=DETAIL torchrun --nproc_per_node=8 --nnodes=8 \

However, the code is not running due to NCCL Duplicate GPU detected at this line

torch.distributed.barrier()
.

image

Could you please advise on any changes that need to be made either to the .sh script or the train.py code to fully support multi-node training? Any guidance on resolving the NCCL Duplicate GPU error would be greatly appreciated.

Thank you!

@xiaoqian-shen
Copy link
Collaborator

Our code support multi-nodes training and we have trained on company machine with 8x8 H100. Please check if the duplication issue is due to out of memory as the post here or other reasons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants