You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for your great work on this project and for sharing the code!
I have successfully run the code on a single node with 8 GPUs. However, when attempting to scale to multi-node training, I encountered an issue with the current longvu/train.py, which seems to not fully support multi-node training.
I am trying to run multi-node training and the .sh should support that:
Could you please advise on any changes that need to be made either to the .sh script or the train.py code to fully support multi-node training? Any guidance on resolving the NCCL Duplicate GPU error would be greatly appreciated.
Thank you!
The text was updated successfully, but these errors were encountered:
Our code support multi-nodes training and we have trained on company machine with 8x8 H100. Please check if the duplication issue is due to out of memory as the post here or other reasons.
Hi @xiaoqian-shen,
Thank you for your great work on this project and for sharing the code!
I have successfully run the code on a single node with 8 GPUs. However, when attempting to scale to multi-node training, I encountered an issue with the current longvu/train.py, which seems to not fully support multi-node training.
I am trying to run multi-node training and the .sh should support that:
LongVU/scripts/train_image_qwen.sh
Line 7 in 1ca4286
However, the code is not running due to NCCL Duplicate GPU detected at this line
LongVU/longvu/train.py
Line 825 in 1ca4286
Could you please advise on any changes that need to be made either to the .sh script or the train.py code to fully support multi-node training? Any guidance on resolving the NCCL Duplicate GPU error would be greatly appreciated.
Thank you!
The text was updated successfully, but these errors were encountered: