-
-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix ROCm CI #844
Fix ROCm CI #844
Conversation
Include recent changes from CUDA pipeline and use latest OpenMPI + UCX
@@ -217,6 +186,7 @@ | |||
' | |||
|
|||
echo "+++ Run tests" | |||
export JULIA_MPI_TEST_EXCLUDE="test_allreduce.jl,test_reduce.jl,test_scan.jl" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to investigate why these tests fail on AMDGPU backend. Maybe @pxl-th has an idea?
AMDGPU tests pass now having updated the Pipeline to @vchuravy suggestion in #840 and upon using latest OpenMPI and UCX:
The only failing tests are CUDA tests still fail though. |
@vchuravy all CUDA tests are failing |
Also, CUDA Buildkite workflows (builds and compilation during test) are running close to an order of magnitude slower compared to ROCm ones |
They are passing on main https://buildkite.com/julialang/mpi-dot-jl/builds/1451 They slow-down is very weird and it looks like things slow down when things are running parallel on the same machine. |
With concurrency set to 1 test should run serially? Why test pass on master and not here as you merged master in this branch? |
So, using latest OpenMPI and UCX as for ROCm in CUDA Buildkite CI segfaults. Rolling back to versions as on Tests now pass (with exception of |
Address #841
Superseed #839 wrt AMDGPU compat
Getting ROCm CI back on track