Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(GTL DEBUG: 0) cuIpcGetMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE on pm-gpu after kokkos 4.2 PR #6422

Closed
ndkeen opened this issue May 16, 2024 · 2 comments · Fixed by #6423
Labels
kokkos Machine Files pm-gpu Perlmutter machine at NERSC (GPU nodes)

Comments

@ndkeen
Copy link
Contributor

ndkeen commented May 16, 2024

After #6101 which brings in kokkos 4.2, we see runtime error with a test like:
ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1.pm-gpu_gnugpu
(also on similar machine ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1.muller-gpu_gnugpu)

0: (GTL DEBUG: 0) cuIpcGetMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE, line no 97

A fix that seems to work is to add this build flag:
Kokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=OFF

This will also be needed for scream repo.
@bartgol @mahf708

@bartgol
Copy link
Contributor

bartgol commented May 16, 2024

I just opened a quick follow up PR, to clean up some deprecated code issues (they did not cause falls in e3sm bc deprecated code in looks was allowed).

I can add a fix for this in that PR.

@ndkeen
Copy link
Contributor Author

ndkeen commented May 16, 2024

Fine with me. Just include the change to make the muller-gpu file same as pm-gpu.

ndkeen added a commit that referenced this issue May 20, 2024
…ASYNC=OFF' into next (PR #6423)

After #6101 which brings in kokkos 4.2, we see runtime error with a test like:
ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1.pm-gpu_gnugpu

hits runtime error like:

0: (GTL DEBUG: 0) cuIpcGetMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE, line no 97
unfortunately, the tests hitting this error are also hanging...

A fix that seems to work is to add this build flag:
-DKokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=OFF

The fix is merged in another PR for pm-gpu, so this PR just makes same change to muller-gpu.

Fixes #6422
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kokkos Machine Files pm-gpu Perlmutter machine at NERSC (GPU nodes)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants