Multi-node Training Issue #23

Amshaker · 2024-11-25T17:51:36Z

Thank you for your great work on this project and for sharing the code!

I have successfully run the code on a single node with 8 GPUs. However, when attempting to scale to multi-node training, I encountered an issue with the current longvu/train.py, which seems to not fully support multi-node training.

I am trying to run multi-node training and the .sh should support that:

LongVU/scripts/train_image_qwen.sh

Line 7 in 1ca4286

    
           CUDA_LAUNCH_BLOCKING=1 TORCH_DISTRIBUTED_DEBUG=DETAIL torchrun --nproc_per_node=8 --nnodes=8 \

However, the code is not running due to NCCL Duplicate GPU detected at this line

LongVU/longvu/train.py

Line 825 in 1ca4286

torch.distributed.barrier()

.

Could you please advise on any changes that need to be made either to the .sh script or the train.py code to fully support multi-node training? Any guidance on resolving the NCCL Duplicate GPU error would be greatly appreciated.

Thank you!

xiaoqian-shen · 2024-11-27T15:36:20Z

Our code support multi-nodes training and we have trained on company machine with 8x8 H100. Please check if the duplication issue is due to out of memory as the post here or other reasons.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-node Training Issue #23

Multi-node Training Issue #23

Amshaker commented Nov 25, 2024

xiaoqian-shen commented Nov 27, 2024

Multi-node Training Issue #23

Multi-node Training Issue #23

Comments

Amshaker commented Nov 25, 2024

xiaoqian-shen commented Nov 27, 2024