🐛[BUG]: Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. #691

wlu1998 · 2024-10-16T07:36:26Z

Version

24.07

On which installation method(s) does this occur?

Docker

Describe the issue

when i train_graphcast on a single player with multiple cards，encounter the following error message：
[rank3]:[E1016 03:33:45.317536384 ProcessGroupNCCL.cpp:568] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=50, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600011 milliseconds before timing out.
[rank3]:[E1016 03:33:45.319844411 ProcessGroupNCCL.cpp:1583] [PG 0 Rank 3] Exception (either an error or timeout) detected by watchdog at work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49.
[rank3]:[E1016 03:33:45.319862434 ProcessGroupNCCL.cpp:1628] [PG 0 Rank 3] Timeout at NCCL work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49.
[rank3]:[E1016 03:33:45.319870659 ProcessGroupNCCL.cpp:582] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E1016 03:33:45.380539738 ProcessGroupNCCL.cpp:568] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=50, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600076 milliseconds before timing out.
[rank2]:[E1016 03:33:45.382855328 ProcessGroupNCCL.cpp:1583] [PG 0 Rank 2] Exception (either an error or timeout) detected by watchdog at work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49.
[rank2]:[E1016 03:33:45.382875465 ProcessGroupNCCL.cpp:1628] [PG 0 Rank 2] Timeout at NCCL work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49.
[rank2]:[E1016 03:33:45.382884031 ProcessGroupNCCL.cpp:582] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E1016 03:33:45.400176821 ProcessGroupNCCL.cpp:568] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=50, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600095 milliseconds before timing out.
[rank1]:[E1016 03:33:45.401444623 ProcessGroupNCCL.cpp:1583] [PG 0 Rank 1] Exception (either an error or timeout) detected by watchdog at work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49.
[rank1]:[E1016 03:33:45.401455763 ProcessGroupNCCL.cpp:1628] [PG 0 Rank 1] Timeout at NCCL work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49.
[rank1]:[E1016 03:33:45.401460371 ProcessGroupNCCL.cpp:582] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E1016 03:34:00.349118707 ProcessGroupNCCL.cpp:568] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=50, OpType=ALLTOALL, NumelIn=50507776, NumelOut=50507776, Timeout(ms)=600000) ran for 600018 milliseconds before timing out.
[rank0]:[E1016 03:34:00.351482167 ProcessGroupNCCL.cpp:1583] [PG 0 Rank 0] Exception (either an error or timeout) detected by watchdog at work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49.
[rank0]:[E1016 03:34:00.351502204 ProcessGroupNCCL.cpp:1628] [PG 0 Rank 0] Timeout at NCCL work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49.
[rank0]:[E1016 03:34:00.351511431 ProcessGroupNCCL.cpp:582] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E1016 03:52:46.582729551 ProcessGroupNCCL.cpp:1304] [PG 0 Rank 0] First PG on this rank that detected no heartbeat of its watchdog.
[rank0]:[E1016 03:52:46.582797502 ProcessGroupNCCL.cpp:1342] [PG 0 Rank 0] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=0
[rank0]:[F1016 04:02:46.583303966 ProcessGroupNCCL.cpp:1168] [PG 0 Rank 0] [PG 0 Rank 0] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 0
I added the following code to the environment variables：
export NCCL_IB_DISABLE=1
export TORCH_NCCL_ENABLE_MONITORING=0
but it didn't solve the problem

Minimum reproducible example

No response

Relevant log output

No response

Environment details

docker run -dit --restart always --shm-size 64G --ulimit memlock=-1 --ulimit stack=67108864 --runtime nvidia -v /data/modulus:/data/modulus --name modulus --gpus all nvcr.io/nvidia/modulus/modulus:24.07 /bin/bash

mnabian · 2024-10-17T01:31:24Z

Hi @wlu1998 , have you tried running this on a single GPU?
Also, can you post the output of nvidia-smi command here?

wlu1998 · 2024-10-17T04:32:18Z

Due to device limitations, running on a single GPU may result in “out of memory” issues.

mnabian · 2024-10-17T19:44:27Z

For the default configs, you need a GPU with at least 80GB of memory.
Also, GraphCast only supports distributed data parallelism, so your GPU memory requirement does not reduce if you run on multiple GPUs.
I suggest to try getting this to work on a single GPU for now.
You can reduce the size of the model and the memory overhead by changing some configs like processor_layers, hidden_dim, mesh_level: https://github.com/NVIDIA/modulus/blob/main/examples/weather/graphcast/conf/config.yaml#L29

wlu1998 · 2024-10-18T06:13:47Z

My device is 24*4，so i can only run with with multi-GPU runs. And i want to solve the issue of "Some NCCL operations have failed or timed out".
Thank u for the suggest, i will try. Actually, I have tried to modify the annual data to the daily data.

ram-cherukuri · 2024-11-15T17:23:07Z

@wlu1998 If this issue is resolved, let us know so we can close the issue.

wlu1998 · 2024-11-18T01:30:13Z

@wlu1998 If this issue is resolved, let us know so we can close the issue.

resolved

NicholasCao · 2024-11-27T02:28:10Z

@wlu1998 If this issue is resolved, let us know so we can close the issue.

resolved

HELP! Could you tell me how to resolve?

wlu1998 · 2024-11-29T01:54:40Z

@wlu1998 If this issue is resolved, let us know so we can close the issue.

resolved

HELP! Could you tell me how to resolve?

我把模型架构调小之后用单卡进行的调试。不过nccl的问题一直没有解决，然后在某一次多卡运行的时候突然就好了，我没有做任何改动

wlu1998 added ? - Needs Triage Need team to review and classify bug Something isn't working labels Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛[BUG]: Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. #691

🐛[BUG]: Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. #691

wlu1998 commented Oct 16, 2024

mnabian commented Oct 17, 2024

wlu1998 commented Oct 17, 2024

mnabian commented Oct 17, 2024 •

edited

Loading

wlu1998 commented Oct 18, 2024

ram-cherukuri commented Nov 15, 2024

wlu1998 commented Nov 18, 2024

NicholasCao commented Nov 27, 2024 •

edited

Loading

wlu1998 commented Nov 29, 2024

🐛[BUG]: Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. #691

🐛[BUG]: Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. #691

Comments

wlu1998 commented Oct 16, 2024

Version

On which installation method(s) does this occur?

Describe the issue

Minimum reproducible example

Relevant log output

Environment details

mnabian commented Oct 17, 2024

wlu1998 commented Oct 17, 2024

mnabian commented Oct 17, 2024 • edited Loading

wlu1998 commented Oct 18, 2024

ram-cherukuri commented Nov 15, 2024

wlu1998 commented Nov 18, 2024

NicholasCao commented Nov 27, 2024 • edited Loading

wlu1998 commented Nov 29, 2024

mnabian commented Oct 17, 2024 •

edited

Loading

NicholasCao commented Nov 27, 2024 •

edited

Loading