Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ROCm CI #844

Merged
merged 8 commits into from
Jun 23, 2024
Merged

Fix ROCm CI #844

merged 8 commits into from
Jun 23, 2024

Conversation

luraess
Copy link
Contributor

@luraess luraess commented Jun 20, 2024

Address #841
Superseed #839 wrt AMDGPU compat

Getting ROCm CI back on track

@@ -217,6 +186,7 @@
'

echo "+++ Run tests"
export JULIA_MPI_TEST_EXCLUDE="test_allreduce.jl,test_reduce.jl,test_scan.jl"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to investigate why these tests fail on AMDGPU backend. Maybe @pxl-th has an idea?

@luraess
Copy link
Contributor Author

luraess commented Jun 20, 2024

AMDGPU tests pass now having updated the Pipeline to @vchuravy suggestion in #840 and upon using latest OpenMPI and UCX:

OPENMPI_VER: "5.0"
OPENMPI_VER_FULL: "5.0.3"
UCX_VER: "1.17.0"

The only failing tests are test_allreduce.jl,test_reduce.jl,test_scan.jl which I for now excluded using the ENV var mechanism.

CUDA tests still fail though.

@vchuravy vchuravy mentioned this pull request Jun 21, 2024
@luraess
Copy link
Contributor Author

luraess commented Jun 21, 2024

@vchuravy all CUDA tests are failing

@luraess
Copy link
Contributor Author

luraess commented Jun 22, 2024

Also, CUDA Buildkite workflows (builds and compilation during test) are running close to an order of magnitude slower compared to ROCm ones

@vchuravy
Copy link
Member

They are passing on main https://buildkite.com/julialang/mpi-dot-jl/builds/1451

They slow-down is very weird and it looks like things slow down when things are running parallel on the same machine.

@luraess
Copy link
Contributor Author

luraess commented Jun 22, 2024

With concurrency set to 1 test should run serially?

Why test pass on master and not here as you merged master in this branch?

@luraess
Copy link
Contributor Author

luraess commented Jun 23, 2024

Why test pass on master and not here as you merged master in this branch?

So, using latest OpenMPI and UCX as for ROCm in CUDA Buildkite CI segfaults. Rolling back to versions as on master fixes it.

Tests now pass (with exception of test_allreduce.jl,test_reduce.jl,test_scan.jl for ROCm), and codecov seems complaining about changes and project.

@vchuravy vchuravy merged commit 5e6557d into master Jun 23, 2024
52 of 54 checks passed
@vchuravy vchuravy deleted the lr/rocm-ci branch June 23, 2024 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants