-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update ekat to a version that has Kokkos 4.2 as submodule #6101
Conversation
|
Your list is quite comprehensive. Crusher seems to be in bad shape, so you might need to run the EAMxx tests on Frontier. The key EAMxx v1 tests are the ERS/P and PEM ones; there aren't baselines, so the only testing is for restart/PE-layout BFBness. There is one additional set of tests you might consider running, to assure C++/F90 BFBness of the dycore: the Homme standalone tests on Summit or Ascent. I recently added code to make it easy to run these. You can use homme/cmake/machineFiles/summit-bfb.cmake on both Summit and Ascent, and that file does the config necessary to get easy BFB ctest'ing. I usually get an interactive node ( |
Great, thanks! I thought about Frontier, but I was held back by the fact that we don't have baselines there. However, ERS/P tests can still be useful, even without baselines, so I will def run those (if crusher is still sick). |
Keep in mind the baselines issue is true for every platform except Chrysalis for SCREAMv1 tests. |
0d9dad8
to
76d3527
Compare
76d3527
to
524527d
Compare
524527d
to
903e475
Compare
An update on this. I am hitting NaNs on Chrysalis, and I tracked it down to some packed scan operations. The core issue is that, when initializing the result var of a scan op, Kokkos uses the default constructor of "ValueType". For ekat::Pack, that ctor inits everything to NaN (to easily track uninited-stuff). I'm discussing with kokkos folks as of why they don't use something like |
903e475
to
92d62e5
Compare
92d62e5
to
a2a5813
Compare
a2a5813
to
62bec2a
Compare
62bec2a
to
77d3108
Compare
For SCREAMv1 compset testing on Chrysalis, the fails all look the same. The stacktrace (see below) seems to point to some sort of error during MPI initialization, which, beside of being completely out of our control, is also completely independent on Kokkos.
But scream nightlies are also getting that error, and Rob mentioned an upgrade to chrys drivers that is causing issues, with a fix worked on by ANL folks. No need to sweat on chrys fails (yet). |
@bartgol Chrysalis had some some updates last week that may have caused the MPI fails. Please try your tests again. |
@bartgol how is this going? |
@rljacob I was out almost 2 weeks due to knee surgery. I am back now, and this is a priority on my todo list. I think I just need to check EAMxx testing on frontier, and then we can integrate. It's a pain to test so many testsuites manually, since by the time I figure out the fix for one DIFF/FAIL, some other build will fail due to master baselines being updated (forcing a rebase). So as soon as I confirm that eamxx on frontier is ok, I would like to merge to next, to start integration. |
8794f8b
to
3bc66c1
Compare
Besides doing nothing, InitArguments is deprecated in Kokkos 4.0
Kokkos 4.0 no longer allows use of volatile in this context
Some of the exec spaces static methods are no longer static
The KOKKOS_TARGET macro prevents Kokkos::initialize to be called twice in kokkos targets. Kokkos 4.2 no longer tolerates double initialization, so we must prevent it.
… is not supported
3bc66c1
to
74611c9
Compare
@rljacob I think this branch is ready for integration. Can we pipeline it? I think there were 2 diffs in total, but keeping up with rebases was a pain, so I'd like to give it a shot with next testing... |
pipeline it? github says there's no conflicts. |
I mean, I don't know if next is open, and/or if other PRs were already scheduled for integration. I just want this to be put in line. Pinging @jgfouca as well, since he's the assignee. |
Btw, @rljacob this PR includes the mod that is pipelined in eamxx via E3SM-Project/scream#2799. Would you like to do a similar PR in E3SM first, and then integrate this PR? |
No its ok to be in this PR. |
Is this ready to merge to next? |
Jim, I think we can merge to next, yes. |
Update ekat to a version that has Kokkos 4.2 as submodule This PR will take time to integrate, I'm opening it so I can keep track of what I check. e3sm_integration: chrysalis (intel): all PASS pm-cpu (intel) 127 PASS and 1 DIFF: SMS_Ld2.ne30pg2_r05_IcoswISC30E3r5.BGCEXP_CNTL_CNPECACNT_1850.pm-cpu_intel.elm-bgcexp (which is currently consistently failing in the e3sm_integration_next_intel nightly build) e3sm_developer: pm-cpu (gnu): 75PASS and 1 DIFF: ERP_Ld3.ne4pg2_oQU480.F2010.pm-cpu_gnu homme_integration chrysalis (intel): PASS pm-cpu (gnu): builds PASS, stuck in Q for run, so I cancelled it. pm-cpu is not tested in nightlies anyways eamxx testing (from eamxx repo, with a few additional commits for eamxx) v1 (CIME) chrysalis (intel): 10 PASS (all scream v1) 5 DIFF (all scream v0) pm-cpu (gnu): 3 PASS, 1 DIFF frontier PEND ascent: no longer part of eamxx nightlies pm-gpu (gnugpu): 7 PASS, 5 DIFF. All DIFF are in debug mode, while all non-debug builds pass. I'm trying to understand what's the catch. standalone mappy (gnu): all PASS weaver (gnu+cuda): all PASS [BFB]
Merged to next. |
Update: we reverted the merge to next, since it will likely conflict with #6226 . We will resume integration of this PR once that one is merged. |
Reverted off of next. |
This PR will take time to integrate, I'm opening it so I can keep track of what I check. e3sm_integration: chrysalis (intel): all PASS pm-cpu (intel) 127 PASS and 1 DIFF: SMS_Ld2.ne30pg2_r05_IcoswISC30E3r5.BGCEXP_CNTL_CNPECACNT_1850.pm-cpu_intel.elm-bgcexp (which is currently consistently failing in the e3sm_integration_next_intel nightly build) e3sm_developer: pm-cpu (gnu): 75 PASS and 1 DIFF: ERP_Ld3.ne4pg2_oQU480.F2010.pm-cpu_gnu homme_integration: chrysalis (intel): PASS pm-cpu (gnu): builds PASS, stuck in Q for run, so I cancelled it. pm-cpu is not tested in nightlies anyways eamxx testing (from eamxx repo, with a few additional commits for eamxx) v1 (CIME) chrysalis (intel): 10 PASS (all scream v1) 5 DIFF (all scream v0) pm-cpu (gnu): 3 PASS, 1 DIFF frontier PEND ascent: no longer part of eamxx nightlies pm-gpu (gnugpu): 7 PASS, 5 DIFF. All DIFF are in debug mode, while all non-debug builds pass. I'm trying to understand what's the catch. standalone mappy (gnu): all PASS weaver (gnu+cuda): all PASS [BFB]
Merged to next |
The fails on CDash of next, as of May 9th are a bunch. Excluding the I and G cases, which should not depend on ekat/kokkos, we have the builds listed below. As I go through the builds, I'll add an explanation of the fails next to them, and if they are not this PR's fault, I'll check them out pm-cpu, e3sm_integration_next_intel:
chrysalis, e3sm_integration_next_intel:
pm-cpu, e3sm_prod_next_intel:
compy, e3sm_prod_next_intel:
mappy, e3sm_developer_next_gnu:
anvil, e3sm_prod_next_intel: all thee jobs seem to hit some batch scheduler issue. They either get canceled while running, or they are submitted but never produce any log in RUNDIR. It has been like this for a few days. I'm thinking it's nothing to do with this PR.
bebop, e3sm_extra_coverage_next_intel:
pm-cpu, e3sm_superbfb_next_intel:
|
@jgfouca @rljacob I went through the yellow boxes of the MustPass and MustPass_wBaseline builds on cdash. I only checked F cases, since from what I understand CRYO/G/I cases are not using active atm, so they are not building kokkos. For all failures I found a reason that seems to be unrelated with this PR. The only builds I can't deem as "ok" (at least from the point of view of merging this PR) are the bebop builds, since we need the new modules PR to go in order for kokkos 4.2 to be happy. I am thinking that we could merge this PR as is, since the passes with Intel on other platforms make me confident we won't have many surprises once the bebop modules PR goes in (but I will of course keep an eye out, and jump in if F cases still fail due to kokkos shenanigans once that PR goes in). What are your thoughts? |
Yes its fine to merge this without waiting for the bebop fixes. |
…ASYNC=OFF' into next (PR #6423) After #6101 which brings in kokkos 4.2, we see runtime error with a test like: ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1.pm-gpu_gnugpu hits runtime error like: 0: (GTL DEBUG: 0) cuIpcGetMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE, line no 97 unfortunately, the tests hitting this error are also hanging... A fix that seems to work is to add this build flag: -DKokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=OFF The fix is merged in another PR for pm-gpu, so this PR just makes same change to muller-gpu. Fixes #6422
This PR will take time to integrate, I'm opening it so I can keep track of what I check.
e3sm_integration_next_intel
nightly build)ascent: no longer part of eamxx nightlies@ambrad @oksanaguba can you think of any more machine/testsuite I should run?