Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛[BUG]: Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. #691

Open
wlu1998 opened this issue Oct 16, 2024 · 8 comments
Labels
? - Needs Triage Need team to review and classify bug Something isn't working

Comments

@wlu1998
Copy link

wlu1998 commented Oct 16, 2024

Version

24.07

On which installation method(s) does this occur?

Docker

Describe the issue

when i train_graphcast on a single player with multiple cards,encounter the following error message:
[rank3]:[E1016 03:33:45.317536384 ProcessGroupNCCL.cpp:568] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=50, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600011 milliseconds before timing out.
[rank3]:[E1016 03:33:45.319844411 ProcessGroupNCCL.cpp:1583] [PG 0 Rank 3] Exception (either an error or timeout) detected by watchdog at work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49.
[rank3]:[E1016 03:33:45.319862434 ProcessGroupNCCL.cpp:1628] [PG 0 Rank 3] Timeout at NCCL work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49.
[rank3]:[E1016 03:33:45.319870659 ProcessGroupNCCL.cpp:582] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E1016 03:33:45.380539738 ProcessGroupNCCL.cpp:568] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=50, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600076 milliseconds before timing out.
[rank2]:[E1016 03:33:45.382855328 ProcessGroupNCCL.cpp:1583] [PG 0 Rank 2] Exception (either an error or timeout) detected by watchdog at work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49.
[rank2]:[E1016 03:33:45.382875465 ProcessGroupNCCL.cpp:1628] [PG 0 Rank 2] Timeout at NCCL work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49.
[rank2]:[E1016 03:33:45.382884031 ProcessGroupNCCL.cpp:582] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E1016 03:33:45.400176821 ProcessGroupNCCL.cpp:568] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=50, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600095 milliseconds before timing out.
[rank1]:[E1016 03:33:45.401444623 ProcessGroupNCCL.cpp:1583] [PG 0 Rank 1] Exception (either an error or timeout) detected by watchdog at work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49.
[rank1]:[E1016 03:33:45.401455763 ProcessGroupNCCL.cpp:1628] [PG 0 Rank 1] Timeout at NCCL work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49.
[rank1]:[E1016 03:33:45.401460371 ProcessGroupNCCL.cpp:582] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E1016 03:34:00.349118707 ProcessGroupNCCL.cpp:568] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=50, OpType=ALLTOALL, NumelIn=50507776, NumelOut=50507776, Timeout(ms)=600000) ran for 600018 milliseconds before timing out.
[rank0]:[E1016 03:34:00.351482167 ProcessGroupNCCL.cpp:1583] [PG 0 Rank 0] Exception (either an error or timeout) detected by watchdog at work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49.
[rank0]:[E1016 03:34:00.351502204 ProcessGroupNCCL.cpp:1628] [PG 0 Rank 0] Timeout at NCCL work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49.
[rank0]:[E1016 03:34:00.351511431 ProcessGroupNCCL.cpp:582] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E1016 03:52:46.582729551 ProcessGroupNCCL.cpp:1304] [PG 0 Rank 0] First PG on this rank that detected no heartbeat of its watchdog.
[rank0]:[E1016 03:52:46.582797502 ProcessGroupNCCL.cpp:1342] [PG 0 Rank 0] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=0
[rank0]:[F1016 04:02:46.583303966 ProcessGroupNCCL.cpp:1168] [PG 0 Rank 0] [PG 0 Rank 0] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 0
I added the following code to the environment variables:
export NCCL_IB_DISABLE=1
export TORCH_NCCL_ENABLE_MONITORING=0
but it didn't solve the problem

Minimum reproducible example

No response

Relevant log output

No response

Environment details

docker run -dit --restart always --shm-size 64G --ulimit memlock=-1 --ulimit stack=67108864 --runtime nvidia -v /data/modulus:/data/modulus --name modulus --gpus all nvcr.io/nvidia/modulus/modulus:24.07 /bin/bash

@wlu1998 wlu1998 added ? - Needs Triage Need team to review and classify bug Something isn't working labels Oct 16, 2024
@mnabian
Copy link
Collaborator

mnabian commented Oct 17, 2024

Hi @wlu1998 , have you tried running this on a single GPU?
Also, can you post the output of nvidia-smi command here?

@wlu1998
Copy link
Author

wlu1998 commented Oct 17, 2024

Due to device limitations, running on a single GPU may result in “out of memory” issues.
Image

@mnabian
Copy link
Collaborator

mnabian commented Oct 17, 2024

For the default configs, you need a GPU with at least 80GB of memory.
Also, GraphCast only supports distributed data parallelism, so your GPU memory requirement does not reduce if you run on multiple GPUs.
I suggest to try getting this to work on a single GPU for now.
You can reduce the size of the model and the memory overhead by changing some configs like processor_layers, hidden_dim, mesh_level: https://github.com/NVIDIA/modulus/blob/main/examples/weather/graphcast/conf/config.yaml#L29

@wlu1998
Copy link
Author

wlu1998 commented Oct 18, 2024

My device is 24*4,so i can only run with with multi-GPU runs. And i want to solve the issue of "Some NCCL operations have failed or timed out".
Thank u for the suggest, i will try. Actually, I have tried to modify the annual data to the daily data.

@ram-cherukuri
Copy link
Collaborator

@wlu1998 If this issue is resolved, let us know so we can close the issue.

@wlu1998
Copy link
Author

wlu1998 commented Nov 18, 2024

@wlu1998 If this issue is resolved, let us know so we can close the issue.

resolved

@NicholasCao
Copy link

NicholasCao commented Nov 27, 2024

@wlu1998 If this issue is resolved, let us know so we can close the issue.

resolved

HELP! Could you tell me how to resolve?

@wlu1998
Copy link
Author

wlu1998 commented Nov 29, 2024

@wlu1998 If this issue is resolved, let us know so we can close the issue.

resolved

HELP! Could you tell me how to resolve?

我把模型架构调小之后用单卡进行的调试。不过nccl的问题一直没有解决,然后在某一次多卡运行的时候突然就好了,我没有做任何改动

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants