-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about failing collective allreduce across on non-homogenous ring (NVIDIA and AMD GPU) #1043
Comments
Hi @RafalSiwek
Please collect logs for these tests with UCC_LOG_LEVEL=debug and UCC_COLL_TRACE=debug |
Hi @Sergei-Lebedev and thanks for the quick reply! I ran the additional tests as requested and gathered logs for each one with Bidirectional Send_RecvDuring this test, an error occurred on the ROCm node in the Additional Tests SummaryI also confirmed that:
To gather more details about the ROCm issue, I ran additional jobs to capture UCX debug logs (UCX logs with To verify that the issue is specific to ROCm-only ring communication, I also tested simple Let me know if there’s anything else you’d like me to try or if additional configurations might help with troubleshooting. Thank you again for your time and help with this! |
Can you please confirm that on the system with the AMD GPU Large BAR support is enabled? This error typically occurs if Large BAR has not been set up correctly on the system. See also https://github.com/openucx/ucx/wiki/Build-and-run-ROCM-UCX-OpenMPI#Sanity-Check-for-Large-BAR-setting There is a way on how to circumvent that as well, but it will require a flurry of environment variables for UCX, UCC, and Open MPI, and the performance will be suboptimal in that case |
I ran the Large BAR check on the AMD GPU instance, and it appears that Large BAR support is indeed not enabled on this AWS EC2 machine: $ ./check_large_bar
address buf 0x780061c00000
Segmentation fault (core dumped) As far as I know, AWS doesn’t allow direct access to BIOS settings, so I’m unable to enable Large BAR support from my end. If there’s an alternative workaround, even if it requires setting additional environment variables for UCX, UCC, and Open MPI, I’d be glad to try it out, understanding that it may impact performance. Thanks again for the guidance! |
ok, try to set in that case the following environment variables, this should get around the large BAR requirement.
(the last two are only required if you use Open MPI version 5 with rocm support) Do you know by any chance what GPUs they offer? I don't think they have Instinct GPUs, and just want to make sure that they are actually supported by the ROCm software stack |
Worked like a charm—thanks!
Yes, for the I know this isn’t an ideal solution for production, and it’s intended just as a proof of concept on my end. I’ll be moving forward with testing distributed PyTorch jobs on this setup, so for now, my issue is resolved. Thanks again for all your help! |
Dear UCC Team,
I'm working on a proof-of-concept (PoC) setup for collective operations with a mixed environment of NVIDIA and AMD GPUs on AWS, specifically using
g4ad.xlarge
(AMD Radeon Pro V520 with ROCm) andg4dn.xlarge
(NVIDIA T4 with CUDA) instances. The goal is to enable distributed PyTorch jobs in OCI containers across these heterogeneous GPUs, using MPI with UCC and UCX for communication.(I've set up a repository with all relevant code, log outputs, and observations to provide additional context if needed: https://github.com/RafalSiwek/troubleshoot-heterogenous-distributed-operations.)
Issue Summary
I've encountered consistent failures when attempting collective communication, specifically with the
allreduce
operation. Here’s a summary of my PoC setup and the steps I’ve tried so far:Infra Setup:
g4ad.xlarge
instance running ROCm 6.2.2.g4dn.xlarge
instance running CUDA 12.4.Observed Behavior:
send_recv
operations between nodes succeed. (log output)allreduce
consistently fails on the ROCm node with errors in theucp_mem_type_unpack
function. (log output with backtrace)-mca pml_ucx_tls=tcp
didn’t resolve the issue. (log output with backtrace)-x UCC_TL_UCP_TUNE=inf
based on UCC Issue #1039, which allowed the CUDA node to complete the operation but not the ROCm node. (log output with backtrace)Request for Guidance
I'm hoping to understand if there's a fundamental incompatibility with UCC when handling collective operations across non-homogeneous GPU environments, or if there’s a configuration adjustment that could help stabilize this setup for the PoC.
Could you please provide insights into:
I’d be grateful for any guidance or advice that could help me understand if this setup is feasible with UCC, or if there’s another approach I should consider for this PoC.
Thank you very much for your time and insights!
The text was updated successfully, but these errors were encountered: