Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about failing collective allreduce across on non-homogenous ring (NVIDIA and AMD GPU) #1043

Closed
RafalSiwek opened this issue Oct 31, 2024 · 6 comments

Comments

@RafalSiwek
Copy link

Dear UCC Team,

I'm working on a proof-of-concept (PoC) setup for collective operations with a mixed environment of NVIDIA and AMD GPUs on AWS, specifically using g4ad.xlarge (AMD Radeon Pro V520 with ROCm) and g4dn.xlarge (NVIDIA T4 with CUDA) instances. The goal is to enable distributed PyTorch jobs in OCI containers across these heterogeneous GPUs, using MPI with UCC and UCX for communication.

(I've set up a repository with all relevant code, log outputs, and observations to provide additional context if needed: https://github.com/RafalSiwek/troubleshoot-heterogenous-distributed-operations.)

Issue Summary

I've encountered consistent failures when attempting collective communication, specifically with the allreduce operation. Here’s a summary of my PoC setup and the steps I’ve tried so far:

  • Infra Setup:

    • AMD g4ad.xlarge instance running ROCm 6.2.2.
    • NVIDIA g4dn.xlarge instance running CUDA 12.4.
    • MPI with UCC and UCX as the transport layer.
    • Docker images for each GPU type, configured with UCX and UCC builds (see the paragraph describing Dockerfiles with installation scripts and config outputs for UCC and UCX here):
  • Observed Behavior:

    • Simple send_recv operations between nodes succeed. (log output)
    • allreduce consistently fails on the ROCm node with errors in the ucp_mem_type_unpack function. (log output with backtrace)
    • Setting UCX to use only the TCP layer with -mca pml_ucx_tls=tcp didn’t resolve the issue. (log output with backtrace)
    • I also tried setting -x UCC_TL_UCP_TUNE=inf based on UCC Issue #1039, which allowed the CUDA node to complete the operation but not the ROCm node. (log output with backtrace)
    • I was not able to identify the root cause in the UCX debug logs. (log output)

Request for Guidance

I'm hoping to understand if there's a fundamental incompatibility with UCC when handling collective operations across non-homogeneous GPU environments, or if there’s a configuration adjustment that could help stabilize this setup for the PoC.

Could you please provide insights into:

  1. Whether the current UCC architecture supports collective communication across this setup.
  2. Potential configuration changes or adjustments within UCC or UCX that could help with this mixed setup.

I’d be grateful for any guidance or advice that could help me understand if this setup is feasible with UCC, or if there’s another approach I should consider for this PoC.

Thank you very much for your time and insights!

@Sergei-Lebedev
Copy link
Contributor

Hi @RafalSiwek
From the logs I don't see anything wrong on UCC side, tests look good also, but for some reason sending data from CUDA node to ROCM node failed. To add more details could please run following tests?

  1. Bidirectional send-recv test. Rank 0 (CUDA) sends and recvs data from Rank1 (ROCM)
  2. Check allreduce works with CUDA only i.e. rank0 and rank1 both use CUDA device
  3. Check allreduce works with ROCM only

Please collect logs for these tests with UCC_LOG_LEVEL=debug and UCC_COLL_TRACE=debug
cc @edgargabriel

@RafalSiwek
Copy link
Author

Hi @Sergei-Lebedev and thanks for the quick reply!

I ran the additional tests as requested and gathered logs for each one with UCC_LOG_LEVEL=debug and UCC_COLL_TRACE=debug. Here are the results:

Bidirectional Send_Recv

During this test, an error occurred on the ROCm node in the ucp_mem_type_unpack method. The log output with the backtrace is available here: Bidirectional Send_Recv Log with Debug.

Additional Tests Summary

I also confirmed that:

  • Allreduce with CUDA-only was successful.
  • Allreduce with ROCm-only resulted in the same error on the ROCm node/rank in the uct_rocm_copy_ep_put_short function.

To gather more details about the ROCm issue, I ran additional jobs to capture UCX debug logs (UCX logs with -x UCX_LOG_LEVEL=DEBUG here and logs with -mca pml_ucx_verbose 100 here).

To verify that the issue is specific to ROCm-only ring communication, I also tested simple send_recv operations. These showed the same error (logs with -x UCX_LOG_LEVEL=DEBUG here and logs with -mca pml_ucx_verbose 100 here). In contrast, the send_recv test on the CUDA-only ring completed without issues.

Let me know if there’s anything else you’d like me to try or if additional configurations might help with troubleshooting. Thank you again for your time and help with this!

@edgargabriel
Copy link
Contributor

Can you please confirm that on the system with the AMD GPU Large BAR support is enabled? This error typically occurs if Large BAR has not been set up correctly on the system. See also https://github.com/openucx/ucx/wiki/Build-and-run-ROCM-UCX-OpenMPI#Sanity-Check-for-Large-BAR-setting

There is a way on how to circumvent that as well, but it will require a flurry of environment variables for UCX, UCC, and Open MPI, and the performance will be suboptimal in that case

@RafalSiwek
Copy link
Author

I ran the Large BAR check on the AMD GPU instance, and it appears that Large BAR support is indeed not enabled on this AWS EC2 machine:

$ ./check_large_bar
address buf 0x780061c00000
Segmentation fault (core dumped)

As far as I know, AWS doesn’t allow direct access to BIOS settings, so I’m unable to enable Large BAR support from my end. If there’s an alternative workaround, even if it requires setting additional environment variables for UCX, UCC, and Open MPI, I’d be glad to try it out, understanding that it may impact performance.

Thanks again for the guidance!

@edgargabriel
Copy link
Contributor

ok, try to set in that case the following environment variables, this should get around the large BAR requirement.

UCX_ROCM_COPY_D2H_THRESH=0
UCX_ROCM_COPY_H2D_THRESH=0
UCC_EC_ROCM_REDUCE_HOST_LIMIT=0
UCC_EC_ROCM_COPY_HOST_LIMIT=0
OMPI_MCA_mpi_accelerator_rocm_memcpyD2H_limit=0
OMPI_MCA_mpi_accelerator_rocm_memcpyH2D_limit=0

(the last two are only required if you use Open MPI version 5 with rocm support)

Do you know by any chance what GPUs they offer? I don't think they have Instinct GPUs, and just want to make sure that they are actually supported by the ROCm software stack

@RafalSiwek
Copy link
Author

Worked like a charm—thanks!

Do you know by any chance what GPUs they offer? I don't think they have Instinct GPUs, and just want to make sure that they are actually supported by the ROCm software stack.

Yes, for the g4dn-* instances, AWS provides AMD Radeon Pro V520 GPUs (RDNA1 architecture with the gfx1011 shader). While ROCm doesn’t officially support this GPU, I’ve managed to build the algebra libraries from source to make it work (feel free to check out my approach here: https://github.com/RafalSiwek/troubleshoot-heterogenous-distributed-operations?tab=readme-ov-file#collective-operations-environment).

I know this isn’t an ideal solution for production, and it’s intended just as a proof of concept on my end. I’ll be moving forward with testing distributed PyTorch jobs on this setup, so for now, my issue is resolved. Thanks again for all your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants