Fix ROCm CI #844

luraess · 2024-06-20T07:19:08Z

Address #841
Superseed #839 wrt AMDGPU compat

Getting ROCm CI back on track

Include recent changes from CUDA pipeline and use latest OpenMPI + UCX

luraess · 2024-06-20T08:18:23Z

.buildkite/pipeline.yml

@@ -217,6 +186,7 @@
            '

        echo "+++ Run tests"
+        export JULIA_MPI_TEST_EXCLUDE="test_allreduce.jl,test_reduce.jl,test_scan.jl"


Need to investigate why these tests fail on AMDGPU backend. Maybe @pxl-th has an idea?

luraess · 2024-06-20T08:19:37Z

AMDGPU tests pass now having updated the Pipeline to @vchuravy suggestion in #840 and upon using latest OpenMPI and UCX:

OPENMPI_VER: "5.0"
OPENMPI_VER_FULL: "5.0.3"
UCX_VER: "1.17.0"

The only failing tests are test_allreduce.jl,test_reduce.jl,test_scan.jl which I for now excluded using the ENV var mechanism.

CUDA tests still fail though.

luraess · 2024-06-21T22:37:01Z

@vchuravy all CUDA tests are failing

luraess · 2024-06-22T13:53:34Z

Also, CUDA Buildkite workflows (builds and compilation during test) are running close to an order of magnitude slower compared to ROCm ones

vchuravy · 2024-06-22T16:22:32Z

They are passing on main https://buildkite.com/julialang/mpi-dot-jl/builds/1451

They slow-down is very weird and it looks like things slow down when things are running parallel on the same machine.

luraess · 2024-06-22T16:39:02Z

With concurrency set to 1 test should run serially?

Why test pass on master and not here as you merged master in this branch?

luraess · 2024-06-23T17:20:39Z

Why test pass on master and not here as you merged master in this branch?

So, using latest OpenMPI and UCX as for ROCm in CUDA Buildkite CI segfaults. Rolling back to versions as on master fixes it.

Tests now pass (with exception of test_allreduce.jl,test_reduce.jl,test_scan.jl for ROCm), and codecov seems complaining about changes and project.

luraess added 5 commits June 20, 2024 09:15

Bump AMDGPU

b6906b5

Adapt Buildkite pipeline

e86e0d7

Include recent changes from CUDA pipeline and use latest OpenMPI + UCX

Comment test

4843803

Exclude reduce tests

ef7f439

Exclude test

cf4c030

luraess commented Jun 20, 2024

View reviewed changes

luraess mentioned this pull request Jun 20, 2024

CUDA CI no longer has version tag #840

Merged

Merge branch 'master' into lr/rocm-ci

9eb4f6b

giordano added CI rocm labels Jun 21, 2024

vchuravy approved these changes Jun 21, 2024

View reviewed changes

vchuravy mentioned this pull request Jun 21, 2024

AMDGPU compat #839

Closed

Try with concurrency limit

3f1eb80

Rollback versions for CUDA tests

8ac8422

vchuravy merged commit 5e6557d into master Jun 23, 2024
52 of 54 checks passed

vchuravy deleted the lr/rocm-ci branch June 23, 2024 17:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ROCm CI #844

Fix ROCm CI #844

luraess commented Jun 20, 2024 •

edited

Loading

luraess Jun 20, 2024

luraess commented Jun 20, 2024

luraess commented Jun 21, 2024

luraess commented Jun 22, 2024

vchuravy commented Jun 22, 2024

luraess commented Jun 22, 2024

luraess commented Jun 23, 2024

Fix ROCm CI #844

Fix ROCm CI #844

Conversation

luraess commented Jun 20, 2024 • edited Loading

luraess Jun 20, 2024

Choose a reason for hiding this comment

luraess commented Jun 20, 2024

luraess commented Jun 21, 2024

luraess commented Jun 22, 2024

vchuravy commented Jun 22, 2024

luraess commented Jun 22, 2024

luraess commented Jun 23, 2024

luraess commented Jun 20, 2024 •

edited

Loading