Update ekat to a version that has Kokkos 4.2 as submodule #6101

bartgol · 2023-12-05T23:51:54Z

This PR will take time to integrate, I'm opening it so I can keep track of what I check.

@ambrad @oksanaguba can you think of any more machine/testsuite I should run?

github-actions · 2023-12-05T23:53:00Z

PR Preview Action v1.4.7
🚀 Deployed preview to https://E3SM-Project.github.io/E3SM/pr-preview/pr-6101/
on branch `gh-pages` at 2024-04-25 18:35 UTC

ambrad · 2023-12-06T03:25:03Z

Your list is quite comprehensive. Crusher seems to be in bad shape, so you might need to run the EAMxx tests on Frontier. The key EAMxx v1 tests are the ERS/P and PEM ones; there aren't baselines, so the only testing is for restart/PE-layout BFBness.

There is one additional set of tests you might consider running, to assure C++/F90 BFBness of the dycore: the Homme standalone tests on Summit or Ascent. I recently added code to make it easy to run these. You can use homme/cmake/machineFiles/summit-bfb.cmake on both Summit and Ascent, and that file does the config necessary to get easy BFB ctest'ing. I usually get an interactive node (bsub -Is -W 0:60 -nnodes 1 -P cli115 /bin/bash), start with ctest -R _ut just to make sure there's nothing obvious the unit tests see, then proceed with the full test suite.

bartgol · 2023-12-06T17:49:49Z

Great, thanks! I thought about Frontier, but I was held back by the fact that we don't have baselines there. However, ERS/P tests can still be useful, even without baselines, so I will def run those (if crusher is still sick).

ambrad · 2023-12-06T17:51:37Z

Great, thanks! I thought about Frontier, but I was held back by the fact that we don't have baselines there. However, ERS/P tests can still be useful, even without baselines, so I will def run those (if crusher is still sick).

Keep in mind the baselines issue is true for every platform except Chrysalis for SCREAMv1 tests.

bartgol · 2024-01-25T16:54:36Z

An update on this. I am hitting NaNs on Chrysalis, and I tracked it down to some packed scan operations. The core issue is that, when initializing the result var of a scan op, Kokkos uses the default constructor of "ValueType". For ekat::Pack, that ctor inits everything to NaN (to easily track uninited-stuff). I'm discussing with kokkos folks as of why they don't use something like Kokkos::reduction_identity<ValueType>::sum(), which seems appropriate. Once I hear back from them, I'll know how to better tackle the issue (which may be "wait for Kokkos 4.3.00 or 4.2.01").

bartgol · 2024-03-14T22:35:42Z

For SCREAMv1 compset testing on Chrysalis, the fails all look the same. The stacktrace (see below) seems to point to some sort of error during MPI initialization, which, beside of being completely out of our control, is also completely independent on Kokkos.

25: forrtl: error (65): floating invalid
25: Image              PC                Routine            Line        Source    
25: libpnetcdf.so.3.0  000015555171E68C  for__signal_handl     Unknown  Unknown
25: libpthread-2.28.s  00001555453EFCF0  Unknown               Unknown  Unknown
25: libucp.so.0.0.0    00001555402AA57F  ucp_proto_perf_en     Unknown  Unknown
25: libucp.so.0.0.0    00001555402AAA50  ucp_proto_init_pa     Unknown  Unknown
25: libucp.so.0.0.0    00001555402AB8EC  ucp_proto_common_     Unknown  Unknown
25: libucp.so.0.0.0    00001555402B127C  ucp_proto_multi_i     Unknown  Unknown
25: libucp.so.0.0.0    00001555402E013A  Unknown               Unknown  Unknown
25: libucp.so.0.0.0    00001555402B1CDB  Unknown               Unknown  Unknown
25: libucp.so.0.0.0    00001555402B2BB2  Unknown               Unknown  Unknown
25: libucp.so.0.0.0    00001555402B2DC4  ucp_proto_select_     Unknown  Unknown
25: libucp.so.0.0.0    00001555402B39A7  ucp_proto_select_     Unknown  Unknown
25: libucp.so.0.0.0    00001555402A30D8  Unknown               Unknown  Unknown
25: libucp.so.0.0.0    00001555402A330E  ucp_worker_get_ep     Unknown  Unknown
25: libucp.so.0.0.0    0000155540309ADD  ucp_wireup_init_l     Unknown  Unknown
25: libucp.so.0.0.0    000015554028CF75  ucp_ep_create_to_     Unknown  Unknown
25: libucp.so.0.0.0    000015554028D714  Unknown               Unknown  Unknown
25: libucp.so.0.0.0    000015554028DB8E  ucp_ep_create         Unknown  Unknown
25: libmpi.so.40.30.3  0000155545AA7607  mca_pml_ucx_add_p     Unknown  Unknown
25: libmpi.so.40.30.3  0000155545B0D723  ompi_mpi_init         Unknown  Unknown
25: libmpi.so.40.30.3  00001555458D004D  MPI_Init              Unknown  Unknown
25: libmpi_mpifh.so.4  0000155545E729D7  PMPI_Init_f08         Unknown  Unknown
25: e3sm.exe           0000000000437E05  cime_comp_mod_mp_         708  cime_comp_mod.F90
25: e3sm.exe           0000000000499955  MAIN__                     63  cime_driver.F90
25: e3sm.exe           0000000000437D22  Unknown               Unknown  Unknown
25: libc-2.28.so       0000155545052D85  __libc_start_main     Unknown  Unknown
25: e3sm.exe           0000000000437C2E  Unknown               Unknown  Unknown

But scream nightlies are also getting that error, and Rob mentioned an upgrade to chrys drivers that is causing issues, with a fix worked on by ANL folks. No need to sweat on chrys fails (yet).

rljacob · 2024-03-21T16:59:47Z

@bartgol Chrysalis had some some updates last week that may have caused the MPI fails. Please try your tests again.

rljacob · 2024-04-12T04:01:38Z

@bartgol how is this going?

bartgol · 2024-04-16T00:10:00Z

@rljacob I was out almost 2 weeks due to knee surgery. I am back now, and this is a priority on my todo list. I think I just need to check EAMxx testing on frontier, and then we can integrate. It's a pain to test so many testsuites manually, since by the time I figure out the fix for one DIFF/FAIL, some other build will fail due to master baselines being updated (forcing a rebase). So as soon as I confirm that eamxx on frontier is ok, I would like to merge to next, to start integration.

Besides doing nothing, InitArguments is deprecated in Kokkos 4.0

Kokkos 4.0 no longer allows use of volatile in this context

Some of the exec spaces static methods are no longer static

The KOKKOS_TARGET macro prevents Kokkos::initialize to be called twice in kokkos targets. Kokkos 4.2 no longer tolerates double initialization, so we must prevent it.

… is not supported

bartgol · 2024-04-25T18:38:26Z

@rljacob I think this branch is ready for integration. Can we pipeline it? I think there were 2 diffs in total, but keeping up with rebases was a pain, so I'd like to give it a shot with next testing...

rljacob · 2024-04-25T18:41:37Z

pipeline it? github says there's no conflicts.

bartgol · 2024-04-25T18:43:52Z

I mean, I don't know if next is open, and/or if other PRs were already scheduled for integration. I just want this to be put in line.

Pinging @jgfouca as well, since he's the assignee.

bartgol · 2024-04-25T18:44:58Z

Btw, @rljacob this PR includes the mod that is pipelined in eamxx via E3SM-Project/scream#2799. Would you like to do a similar PR in E3SM first, and then integrate this PR?

rljacob · 2024-04-26T04:45:42Z

No its ok to be in this PR.

jgfouca · 2024-04-29T17:15:06Z

Is this ready to merge to next?

bartgol · 2024-04-29T17:38:06Z

Jim, I think we can merge to next, yes.

Update ekat to a version that has Kokkos 4.2 as submodule This PR will take time to integrate, I'm opening it so I can keep track of what I check. e3sm_integration: chrysalis (intel): all PASS pm-cpu (intel) 127 PASS and 1 DIFF: SMS_Ld2.ne30pg2_r05_IcoswISC30E3r5.BGCEXP_CNTL_CNPECACNT_1850.pm-cpu_intel.elm-bgcexp (which is currently consistently failing in the e3sm_integration_next_intel nightly build) e3sm_developer: pm-cpu (gnu): 75PASS and 1 DIFF: ERP_Ld3.ne4pg2_oQU480.F2010.pm-cpu_gnu homme_integration chrysalis (intel): PASS pm-cpu (gnu): builds PASS, stuck in Q for run, so I cancelled it. pm-cpu is not tested in nightlies anyways eamxx testing (from eamxx repo, with a few additional commits for eamxx) v1 (CIME) chrysalis (intel): 10 PASS (all scream v1) 5 DIFF (all scream v0) pm-cpu (gnu): 3 PASS, 1 DIFF frontier PEND ascent: no longer part of eamxx nightlies pm-gpu (gnugpu): 7 PASS, 5 DIFF. All DIFF are in debug mode, while all non-debug builds pass. I'm trying to understand what's the catch. standalone mappy (gnu): all PASS weaver (gnu+cuda): all PASS [BFB]

jgfouca · 2024-04-29T17:57:28Z

Merged to next.

bartgol · 2024-04-29T21:53:01Z

Update: we reverted the merge to next, since it will likely conflict with #6226 . We will resume integration of this PR once that one is merged.

jgfouca · 2024-04-29T21:56:29Z

Reverted off of next.

This PR will take time to integrate, I'm opening it so I can keep track of what I check. e3sm_integration: chrysalis (intel): all PASS pm-cpu (intel) 127 PASS and 1 DIFF: SMS_Ld2.ne30pg2_r05_IcoswISC30E3r5.BGCEXP_CNTL_CNPECACNT_1850.pm-cpu_intel.elm-bgcexp (which is currently consistently failing in the e3sm_integration_next_intel nightly build) e3sm_developer: pm-cpu (gnu): 75 PASS and 1 DIFF: ERP_Ld3.ne4pg2_oQU480.F2010.pm-cpu_gnu homme_integration: chrysalis (intel): PASS pm-cpu (gnu): builds PASS, stuck in Q for run, so I cancelled it. pm-cpu is not tested in nightlies anyways eamxx testing (from eamxx repo, with a few additional commits for eamxx) v1 (CIME) chrysalis (intel): 10 PASS (all scream v1) 5 DIFF (all scream v0) pm-cpu (gnu): 3 PASS, 1 DIFF frontier PEND ascent: no longer part of eamxx nightlies pm-gpu (gnugpu): 7 PASS, 5 DIFF. All DIFF are in debug mode, while all non-debug builds pass. I'm trying to understand what's the catch. standalone mappy (gnu): all PASS weaver (gnu+cuda): all PASS [BFB]

jgfouca · 2024-05-02T17:39:07Z

Merged to next

bartgol · 2024-05-10T02:41:25Z

The fails on CDash of next, as of May 9th are a bunch. Excluding the I and G cases, which should not depend on ekat/kokkos, we have the builds listed below. As I go through the builds, I'll add an explanation of the fails next to them, and if they are not this PR's fault, I'll check them out

pm-cpu, e3sm_integration_next_intel:

SMS_Ln3.ne4pg2_ne4pg2.F2010-MMF2.pm-cpu_intel: build FAIL, but across builds and also in master.
SMS_D_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.pm-cpu_intel.allactive-wcprod: DIFF fail with File 'xyz' had no original counterpart in '<CASE>/run' with suffix ''. next is not generating eam.h5 output stream. Not this PR's fault.
SMS_D_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370.pm-cpu_intel.allactive-wcprodssp: next is not generating eam.h5 and eam.h6 output streams. Not this PR's fault.

chrysalis, e3sm_integration_next_intel:

SMS_D_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370.chrysalis_intel.allactive-wcprodssp: fails in master as well
SMS_Ln3.ne4pg2_ne4pg2.F2010-MMF2.chrysalis_intel: fails in master as well

pm-cpu, e3sm_prod_next_intel:

SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370.pm-cpu_intel.allactive-wcprodssp: next is not generating eam.h5 and eam.h6 output streams. Not this PR's fault.
SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP585.pm-cpu_intel.allactive-wcprodssp: FAIL due to problem retrieving input data. Not this PR's fault.

compy, e3sm_prod_next_intel:

SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370.compy_intel.allactive-wcprodssp: FAIL due to problem retrieving input data. Not this PR's fault.
SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP585.compy_intel.allactive-wcprodssp: next is not generating eam.h5 and eam.h6 output streams. Not this PR's fault.

mappy, e3sm_developer_next_gnu:

SMS_D_Ln5.ne4pg2_oQU480.F2010.mappy_gnu: I get a segfault in both next and master
SMS_R_Ld5.ne4_ne4.FSCM-ARM97.mappy_gnu.eam-scm: I get same DIFF in next and master

anvil, e3sm_prod_next_intel: all thee jobs seem to hit some batch scheduler issue. They either get canceled while running, or they are submitted but never produce any log in RUNDIR. It has been like this for a few days. I'm thinking it's nothing to do with this PR.

SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.F20TR.anvil_intel.eam-wcprod_F20TR
SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850-1pctCO2.anvil_intel.allactive-wcprod_1850_1pctCO2
SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850-4xCO2.anvil_intel.allactive-wcprod_1850_4xCO2
SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.anvil_intel.allactive-wcprod_1850
SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370.anvil_intel.allactive-wcprodssp
SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP585.anvil_intel.allactive-wcprodssp
SMS_Ld1_PS.northamericax4v1pg2_WC14to60E2r3.WCYCL1850.anvil_intel.allactive-wcprodrrm_1850
SMS_Ln5.ne30pg2_r05_IcoswISC30E3r5.F2010.anvil_intel.eam-wcprod_F2010

bebop, e3sm_extra_coverage_next_intel:

pm-cpu, e3sm_superbfb_next_intel:

PET_Ld3_D.ne30pg2_EC30to60E2r2.WCYCL1850.pm-cpu_intel.pemod-omp2: now PASSes

bartgol · 2024-05-15T00:34:14Z

@jgfouca @rljacob I went through the yellow boxes of the MustPass and MustPass_wBaseline builds on cdash. I only checked F cases, since from what I understand CRYO/G/I cases are not using active atm, so they are not building kokkos.

For all failures I found a reason that seems to be unrelated with this PR. The only builds I can't deem as "ok" (at least from the point of view of merging this PR) are the bebop builds, since we need the new modules PR to go in order for kokkos 4.2 to be happy.

I am thinking that we could merge this PR as is, since the passes with Intel on other platforms make me confident we won't have many surprises once the bebop modules PR goes in (but I will of course keep an eye out, and jump in if F cases still fail due to kokkos shenanigans once that PR goes in).

What are your thoughts?

rljacob · 2024-05-15T01:01:45Z

Yes its fine to merge this without waiting for the bebop fixes.

…ASYNC=OFF' into next (PR #6423) After #6101 which brings in kokkos 4.2, we see runtime error with a test like: ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1.pm-gpu_gnugpu hits runtime error like: 0: (GTL DEBUG: 0) cuIpcGetMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE, line no 97 unfortunately, the tests hitting this error are also hanging... A fix that seems to work is to add this build flag: -DKokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=OFF The fix is merged in another PR for pm-gpu, so this PR just makes same change to muller-gpu. Fixes #6422

bartgol added HOMME HOMME standalone issues with the standalone HOMME code that dont impact E3SM kokkos labels Dec 5, 2023

bartgol force-pushed the bartgol/e3sm/kokkos-4.2 branch from 0d9dad8 to 76d3527 Compare December 11, 2023 16:55

rljacob assigned jgfouca Dec 14, 2023

rljacob requested a review from ambrad December 14, 2023 21:36

bartgol force-pushed the bartgol/e3sm/kokkos-4.2 branch from 76d3527 to 524527d Compare January 12, 2024 21:34

rljacob mentioned this pull request Jan 22, 2024

Downstream merge of scream/eamxx fork into E3SM #6153

Merged

bartgol force-pushed the bartgol/e3sm/kokkos-4.2 branch from 524527d to 903e475 Compare January 23, 2024 00:15

bartgol force-pushed the bartgol/e3sm/kokkos-4.2 branch from 903e475 to 92d62e5 Compare January 30, 2024 18:58

bartgol force-pushed the bartgol/e3sm/kokkos-4.2 branch from 92d62e5 to a2a5813 Compare February 7, 2024 20:23

bartgol force-pushed the bartgol/e3sm/kokkos-4.2 branch from a2a5813 to 62bec2a Compare February 26, 2024 21:31

bartgol force-pushed the bartgol/e3sm/kokkos-4.2 branch from 62bec2a to 77d3108 Compare March 7, 2024 04:27

ambrad approved these changes Apr 16, 2024

View reviewed changes

bartgol force-pushed the bartgol/e3sm/kokkos-4.2 branch 2 times, most recently from 8794f8b to 3bc66c1 Compare April 25, 2024 18:32

bartgol added 5 commits April 25, 2024 12:32

Update EKAT submodule, to get Kokkos 4.2

62ba0f4

HOMME: remove pointless lines in compose

bec0414

Besides doing nothing, InitArguments is deprecated in Kokkos 4.0

HOMME: remove volatile qualifier in custom reducer join method

980ef03

Kokkos 4.0 no longer allows use of volatile in this context

HOMME: add missing iostream includes

dc5cb8f

HOMME: fix Kokkos-related compilation error

d7ae678

Some of the exec spaces static methods are no longer static

bartgol added 7 commits April 25, 2024 12:32

HOMME: add missing cmakedefine to preqx config.h.cmake.in file

dabc215

The KOKKOS_TARGET macro prevents Kokkos::initialize to be called twice in kokkos targets. Kokkos 4.2 no longer tolerates double initialization, so we must prevent it.

HOMME: add GllFvRemap cxx source files to preqx_kokkos sources

59ee13f

HOMME: use new CMake syntax from EKAT to build kokkos

a863acb

HOMME: do not set cxx standard. Kokkos will crap out anyways if c++17…

389cd9f

… is not supported

HOMME: fix scope of HIPTraits from kokkos

db1a5b0

EAMxx: add missing iostream include

3160ec7

Disable OpenMP in Kokkos for frontier-scream-gpu machine

74611c9

bartgol force-pushed the bartgol/e3sm/kokkos-4.2 branch from 3bc66c1 to 74611c9 Compare April 25, 2024 18:33

jewatkins mentioned this pull request May 9, 2024

Update MALI tpls on pmcpu & add pmgpu build stephenprice/E3SM#1

Closed

jgfouca merged commit ed030dc into master May 15, 2024
22 checks passed

jgfouca deleted the bartgol/e3sm/kokkos-4.2 branch May 15, 2024 15:16

This was referenced May 16, 2024

(GTL DEBUG: 0) cuIpcGetMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE on pm-gpu after kokkos 4.2 PR #6422

Closed

For muller-gpu, add -DKokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=OFF #6423

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update ekat to a version that has Kokkos 4.2 as submodule #6101

Update ekat to a version that has Kokkos 4.2 as submodule #6101

bartgol commented Dec 5, 2023 •

edited

Loading

github-actions bot commented Dec 5, 2023 •

edited

Loading

ambrad commented Dec 6, 2023

bartgol commented Dec 6, 2023

ambrad commented Dec 6, 2023

bartgol commented Jan 25, 2024

bartgol commented Mar 14, 2024

rljacob commented Mar 21, 2024

rljacob commented Apr 12, 2024

bartgol commented Apr 16, 2024

bartgol commented Apr 25, 2024

rljacob commented Apr 25, 2024

bartgol commented Apr 25, 2024

bartgol commented Apr 25, 2024

rljacob commented Apr 26, 2024

jgfouca commented Apr 29, 2024

bartgol commented Apr 29, 2024

jgfouca commented Apr 29, 2024

bartgol commented Apr 29, 2024

jgfouca commented Apr 29, 2024

jgfouca commented May 2, 2024

bartgol commented May 10, 2024 •

edited

Loading

bartgol commented May 15, 2024

rljacob commented May 15, 2024

Update ekat to a version that has Kokkos 4.2 as submodule #6101

Update ekat to a version that has Kokkos 4.2 as submodule #6101

Conversation

bartgol commented Dec 5, 2023 • edited Loading

github-actions bot commented Dec 5, 2023 • edited Loading

ambrad commented Dec 6, 2023

bartgol commented Dec 6, 2023

ambrad commented Dec 6, 2023

bartgol commented Jan 25, 2024

bartgol commented Mar 14, 2024

rljacob commented Mar 21, 2024

rljacob commented Apr 12, 2024

bartgol commented Apr 16, 2024

bartgol commented Apr 25, 2024

rljacob commented Apr 25, 2024

bartgol commented Apr 25, 2024

bartgol commented Apr 25, 2024

rljacob commented Apr 26, 2024

jgfouca commented Apr 29, 2024

bartgol commented Apr 29, 2024

jgfouca commented Apr 29, 2024

bartgol commented Apr 29, 2024

jgfouca commented Apr 29, 2024

jgfouca commented May 2, 2024

bartgol commented May 10, 2024 • edited Loading

bartgol commented May 15, 2024

rljacob commented May 15, 2024

bartgol commented Dec 5, 2023 •

edited

Loading

github-actions bot commented Dec 5, 2023 •

edited

Loading

bartgol commented May 10, 2024 •

edited

Loading