Question about failing collective allreduce across on non-homogenous ring (NVIDIA and AMD GPU) #1043

RafalSiwek · 2024-10-31T00:17:25Z

Dear UCC Team,

I'm working on a proof-of-concept (PoC) setup for collective operations with a mixed environment of NVIDIA and AMD GPUs on AWS, specifically using g4ad.xlarge (AMD Radeon Pro V520 with ROCm) and g4dn.xlarge (NVIDIA T4 with CUDA) instances. The goal is to enable distributed PyTorch jobs in OCI containers across these heterogeneous GPUs, using MPI with UCC and UCX for communication.

(I've set up a repository with all relevant code, log outputs, and observations to provide additional context if needed: https://github.com/RafalSiwek/troubleshoot-heterogenous-distributed-operations.)

Issue Summary

I've encountered consistent failures when attempting collective communication, specifically with the allreduce operation. Here’s a summary of my PoC setup and the steps I’ve tried so far:

Infra Setup:
- AMD g4ad.xlarge instance running ROCm 6.2.2.
- NVIDIA g4dn.xlarge instance running CUDA 12.4.
- MPI with UCC and UCX as the transport layer.
- Docker images for each GPU type, configured with UCX and UCC builds (see the paragraph describing Dockerfiles with installation scripts and config outputs for UCC and UCX here):
Observed Behavior:
- Simple send_recv operations between nodes succeed. (log output)
- allreduce consistently fails on the ROCm node with errors in the ucp_mem_type_unpack function. (log output with backtrace)
- Setting UCX to use only the TCP layer with -mca pml_ucx_tls=tcp didn’t resolve the issue. (log output with backtrace)
- I also tried setting -x UCC_TL_UCP_TUNE=inf based on UCC Issue #1039, which allowed the CUDA node to complete the operation but not the ROCm node. (log output with backtrace)
- I was not able to identify the root cause in the UCX debug logs. (log output)

Request for Guidance

I'm hoping to understand if there's a fundamental incompatibility with UCC when handling collective operations across non-homogeneous GPU environments, or if there’s a configuration adjustment that could help stabilize this setup for the PoC.

Could you please provide insights into:

Whether the current UCC architecture supports collective communication across this setup.
Potential configuration changes or adjustments within UCC or UCX that could help with this mixed setup.

I’d be grateful for any guidance or advice that could help me understand if this setup is feasible with UCC, or if there’s another approach I should consider for this PoC.

Thank you very much for your time and insights!

The text was updated successfully, but these errors were encountered:

Sergei-Lebedev · 2024-10-31T08:55:20Z

Hi @RafalSiwek
From the logs I don't see anything wrong on UCC side, tests look good also, but for some reason sending data from CUDA node to ROCM node failed. To add more details could please run following tests?

Bidirectional send-recv test. Rank 0 (CUDA) sends and recvs data from Rank1 (ROCM)
Check allreduce works with CUDA only i.e. rank0 and rank1 both use CUDA device
Check allreduce works with ROCM only

Please collect logs for these tests with UCC_LOG_LEVEL=debug and UCC_COLL_TRACE=debug
cc @edgargabriel

RafalSiwek · 2024-10-31T13:59:33Z

Hi @Sergei-Lebedev and thanks for the quick reply!

I ran the additional tests as requested and gathered logs for each one with UCC_LOG_LEVEL=debug and UCC_COLL_TRACE=debug. Here are the results:

Bidirectional Send_Recv

During this test, an error occurred on the ROCm node in the ucp_mem_type_unpack method. The log output with the backtrace is available here: Bidirectional Send_Recv Log with Debug.

Additional Tests Summary

I also confirmed that:

Allreduce with CUDA-only was successful.
Allreduce with ROCm-only resulted in the same error on the ROCm node/rank in the uct_rocm_copy_ep_put_short function.

To gather more details about the ROCm issue, I ran additional jobs to capture UCX debug logs (UCX logs with -x UCX_LOG_LEVEL=DEBUG here and logs with -mca pml_ucx_verbose 100 here).

To verify that the issue is specific to ROCm-only ring communication, I also tested simple send_recv operations. These showed the same error (logs with -x UCX_LOG_LEVEL=DEBUG here and logs with -mca pml_ucx_verbose 100 here). In contrast, the send_recv test on the CUDA-only ring completed without issues.

Let me know if there’s anything else you’d like me to try or if additional configurations might help with troubleshooting. Thank you again for your time and help with this!

edgargabriel · 2024-10-31T14:38:51Z

Can you please confirm that on the system with the AMD GPU Large BAR support is enabled? This error typically occurs if Large BAR has not been set up correctly on the system. See also https://github.com/openucx/ucx/wiki/Build-and-run-ROCM-UCX-OpenMPI#Sanity-Check-for-Large-BAR-setting

There is a way on how to circumvent that as well, but it will require a flurry of environment variables for UCX, UCC, and Open MPI, and the performance will be suboptimal in that case

RafalSiwek · 2024-10-31T15:47:18Z

I ran the Large BAR check on the AMD GPU instance, and it appears that Large BAR support is indeed not enabled on this AWS EC2 machine:

$ ./check_large_bar
address buf 0x780061c00000
Segmentation fault (core dumped)

As far as I know, AWS doesn’t allow direct access to BIOS settings, so I’m unable to enable Large BAR support from my end. If there’s an alternative workaround, even if it requires setting additional environment variables for UCX, UCC, and Open MPI, I’d be glad to try it out, understanding that it may impact performance.

Thanks again for the guidance!

edgargabriel · 2024-10-31T17:36:19Z

ok, try to set in that case the following environment variables, this should get around the large BAR requirement.

UCX_ROCM_COPY_D2H_THRESH=0
UCX_ROCM_COPY_H2D_THRESH=0
UCC_EC_ROCM_REDUCE_HOST_LIMIT=0
UCC_EC_ROCM_COPY_HOST_LIMIT=0
OMPI_MCA_mpi_accelerator_rocm_memcpyD2H_limit=0
OMPI_MCA_mpi_accelerator_rocm_memcpyH2D_limit=0

(the last two are only required if you use Open MPI version 5 with rocm support)

Do you know by any chance what GPUs they offer? I don't think they have Instinct GPUs, and just want to make sure that they are actually supported by the ROCm software stack

RafalSiwek · 2024-10-31T18:35:51Z

Worked like a charm—thanks!

Do you know by any chance what GPUs they offer? I don't think they have Instinct GPUs, and just want to make sure that they are actually supported by the ROCm software stack.

Yes, for the g4dn-* instances, AWS provides AMD Radeon Pro V520 GPUs (RDNA1 architecture with the gfx1011 shader). While ROCm doesn’t officially support this GPU, I’ve managed to build the algebra libraries from source to make it work (feel free to check out my approach here: https://github.com/RafalSiwek/troubleshoot-heterogenous-distributed-operations?tab=readme-ov-file#collective-operations-environment).

I know this isn’t an ideal solution for production, and it’s intended just as a proof of concept on my end. I’ll be moving forward with testing distributed PyTorch jobs on this setup, so for now, my issue is resolved. Thanks again for all your help!

RafalSiwek closed this as completed Oct 31, 2024

RafalSiwek mentioned this issue Nov 1, 2024

Failing PyTorch Collective Operations on Non-Homogeneous Ring (NVIDIA and AMD GPUs) #1045

Closed

samnordmann mentioned this issue Nov 8, 2024

[c10d][UCC] Add _reduce_scatter_base to c10d::ProcessGroupUCC pytorch/pytorch#138021

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about failing collective allreduce across on non-homogenous ring (NVIDIA and AMD GPU) #1043

Question about failing collective allreduce across on non-homogenous ring (NVIDIA and AMD GPU) #1043

RafalSiwek commented Oct 31, 2024

Sergei-Lebedev commented Oct 31, 2024

RafalSiwek commented Oct 31, 2024

edgargabriel commented Oct 31, 2024

RafalSiwek commented Oct 31, 2024

edgargabriel commented Oct 31, 2024

RafalSiwek commented Oct 31, 2024

Question about failing collective allreduce across on non-homogenous ring (NVIDIA and AMD GPU) #1043

Question about failing collective allreduce across on non-homogenous ring (NVIDIA and AMD GPU) #1043

Comments

RafalSiwek commented Oct 31, 2024

Issue Summary

Request for Guidance

Sergei-Lebedev commented Oct 31, 2024

RafalSiwek commented Oct 31, 2024

Bidirectional Send_Recv

Additional Tests Summary

edgargabriel commented Oct 31, 2024

RafalSiwek commented Oct 31, 2024

edgargabriel commented Oct 31, 2024

RafalSiwek commented Oct 31, 2024