Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For muller-gpu, add -DKokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=OFF #6423

Merged

Conversation

ndkeen
Copy link
Contributor

@ndkeen ndkeen commented May 16, 2024

After #6101 which brings in kokkos 4.2, we see runtime error with a test like:
ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1.pm-gpu_gnugpu

hits runtime error like:

0: (GTL DEBUG: 0) cuIpcGetMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE, line no 97

unfortunately, the tests hitting this error are also hanging...

A fix that seems to work is to add this build flag:
-DKokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=OFF

The fix is merged in another PR for pm-gpu, so this PR just makes same change to muller-gpu.

Fixes #6422

@ndkeen ndkeen added Machine Files kokkos pm-gpu Perlmutter machine at NERSC (GPU nodes) labels May 16, 2024
@ndkeen ndkeen self-assigned this May 16, 2024
@ndkeen ndkeen requested review from bartgol and mahf708 May 16, 2024 03:24
Copy link

github-actions bot commented May 16, 2024

PR Preview Action v1.4.7
🚀 Deployed preview to https://E3SM-Project.github.io/E3SM/pr-preview/pr-6423/
on branch gh-pages at 2024-05-20 16:40 UTC

Copy link
Contributor

@bartgol bartgol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a similar commit to #6421 . I don't really care which PR this fix is attached to. That one is already in next though. Either way, I'm approving, just in case that one has issues.

@ndkeen
Copy link
Contributor Author

ndkeen commented May 17, 2024

it doesnt look like you made the change to muller-gpu?

@ndkeen ndkeen changed the title For pm-gpu, add -DKokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=OFF For muller-gpu, add -DKokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=OFF May 20, 2024
@ndkeen ndkeen added the BFB PR leaves answers BFB label May 20, 2024
ndkeen added a commit that referenced this pull request May 20, 2024
…ASYNC=OFF' into next (PR #6423)

After #6101 which brings in kokkos 4.2, we see runtime error with a test like:
ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1.pm-gpu_gnugpu

hits runtime error like:

0: (GTL DEBUG: 0) cuIpcGetMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE, line no 97
unfortunately, the tests hitting this error are also hanging...

A fix that seems to work is to add this build flag:
-DKokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=OFF

The fix is merged in another PR for pm-gpu, so this PR just makes same change to muller-gpu.

Fixes #6422
@ndkeen
Copy link
Contributor Author

ndkeen commented May 20, 2024

merged to next

@ndkeen ndkeen merged commit 491ed8a into master May 21, 2024
21 checks passed
@ndkeen ndkeen deleted the ndk/machinefiles/pm-gpu-Kokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=OFF branch May 21, 2024 22:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BFB PR leaves answers BFB kokkos Machine Files pm-gpu Perlmutter machine at NERSC (GPU nodes)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

(GTL DEBUG: 0) cuIpcGetMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE on pm-gpu after kokkos 4.2 PR
2 participants