Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ekat to a version that has Kokkos 4.2 as submodule #6101

Merged
merged 12 commits into from
May 15, 2024

Conversation

bartgol
Copy link
Contributor

@bartgol bartgol commented Dec 5, 2023

This PR will take time to integrate, I'm opening it so I can keep track of what I check.

  • e3sm_integration:
    • chrysalis (intel): all PASS
    • pm-cpu (intel) 127 PASS and 1 DIFF: SMS_Ld2.ne30pg2_r05_IcoswISC30E3r5.BGCEXP_CNTL_CNPECACNT_1850.pm-cpu_intel.elm-bgcexp (which is currently consistently failing in the e3sm_integration_next_intel nightly build)
  • e3sm_developer:
    • pm-cpu (gnu): 75PASS and 1 DIFF: ERP_Ld3.ne4pg2_oQU480.F2010.pm-cpu_gnu
  • homme_integration
    • chrysalis (intel): PASS
    • pm-cpu (gnu): builds PASS, stuck in Q for run, so I cancelled it. pm-cpu is not tested in nightlies anyways
  • eamxx testing (from eamxx repo, with a few additional commits for eamxx)
    • v1 (CIME)
      • chrysalis (intel): 10 PASS (all scream v1) 5 DIFF (all scream v0)
      • pm-cpu (gnu): 3 PASS, 1 DIFF
      • frontier PEND
      • ascent: no longer part of eamxx nightlies
      • pm-gpu (gnugpu): 7 PASS, 5 DIFF. All DIFF are in debug mode, while all non-debug builds pass. I'm trying to understand what's the catch.
    • standalone
      • mappy (gnu): all PASS
      • weaver (gnu+cuda): all PASS

@ambrad @oksanaguba can you think of any more machine/testsuite I should run?

@bartgol bartgol added HOMME HOMME standalone issues with the standalone HOMME code that dont impact E3SM kokkos labels Dec 5, 2023
Copy link

github-actions bot commented Dec 5, 2023

PR Preview Action v1.4.7
🚀 Deployed preview to https://E3SM-Project.github.io/E3SM/pr-preview/pr-6101/
on branch gh-pages at 2024-04-25 18:35 UTC

@ambrad
Copy link
Member

ambrad commented Dec 6, 2023

Your list is quite comprehensive. Crusher seems to be in bad shape, so you might need to run the EAMxx tests on Frontier. The key EAMxx v1 tests are the ERS/P and PEM ones; there aren't baselines, so the only testing is for restart/PE-layout BFBness.

There is one additional set of tests you might consider running, to assure C++/F90 BFBness of the dycore: the Homme standalone tests on Summit or Ascent. I recently added code to make it easy to run these. You can use homme/cmake/machineFiles/summit-bfb.cmake on both Summit and Ascent, and that file does the config necessary to get easy BFB ctest'ing. I usually get an interactive node (bsub -Is -W 0:60 -nnodes 1 -P cli115 /bin/bash), start with ctest -R _ut just to make sure there's nothing obvious the unit tests see, then proceed with the full test suite.

@bartgol
Copy link
Contributor Author

bartgol commented Dec 6, 2023

Great, thanks! I thought about Frontier, but I was held back by the fact that we don't have baselines there. However, ERS/P tests can still be useful, even without baselines, so I will def run those (if crusher is still sick).

@ambrad
Copy link
Member

ambrad commented Dec 6, 2023

Great, thanks! I thought about Frontier, but I was held back by the fact that we don't have baselines there. However, ERS/P tests can still be useful, even without baselines, so I will def run those (if crusher is still sick).

Keep in mind the baselines issue is true for every platform except Chrysalis for SCREAMv1 tests.

@bartgol
Copy link
Contributor Author

bartgol commented Jan 25, 2024

An update on this. I am hitting NaNs on Chrysalis, and I tracked it down to some packed scan operations. The core issue is that, when initializing the result var of a scan op, Kokkos uses the default constructor of "ValueType". For ekat::Pack, that ctor inits everything to NaN (to easily track uninited-stuff). I'm discussing with kokkos folks as of why they don't use something like Kokkos::reduction_identity<ValueType>::sum(), which seems appropriate. Once I hear back from them, I'll know how to better tackle the issue (which may be "wait for Kokkos 4.3.00 or 4.2.01").

@bartgol bartgol force-pushed the bartgol/e3sm/kokkos-4.2 branch from 903e475 to 92d62e5 Compare January 30, 2024 18:58
@bartgol bartgol force-pushed the bartgol/e3sm/kokkos-4.2 branch from 92d62e5 to a2a5813 Compare February 7, 2024 20:23
@bartgol bartgol force-pushed the bartgol/e3sm/kokkos-4.2 branch from a2a5813 to 62bec2a Compare February 26, 2024 21:31
@bartgol bartgol force-pushed the bartgol/e3sm/kokkos-4.2 branch from 62bec2a to 77d3108 Compare March 7, 2024 04:27
@bartgol
Copy link
Contributor Author

bartgol commented Mar 14, 2024

For SCREAMv1 compset testing on Chrysalis, the fails all look the same. The stacktrace (see below) seems to point to some sort of error during MPI initialization, which, beside of being completely out of our control, is also completely independent on Kokkos.

25: forrtl: error (65): floating invalid
25: Image              PC                Routine            Line        Source    
25: libpnetcdf.so.3.0  000015555171E68C  for__signal_handl     Unknown  Unknown
25: libpthread-2.28.s  00001555453EFCF0  Unknown               Unknown  Unknown
25: libucp.so.0.0.0    00001555402AA57F  ucp_proto_perf_en     Unknown  Unknown
25: libucp.so.0.0.0    00001555402AAA50  ucp_proto_init_pa     Unknown  Unknown
25: libucp.so.0.0.0    00001555402AB8EC  ucp_proto_common_     Unknown  Unknown
25: libucp.so.0.0.0    00001555402B127C  ucp_proto_multi_i     Unknown  Unknown
25: libucp.so.0.0.0    00001555402E013A  Unknown               Unknown  Unknown
25: libucp.so.0.0.0    00001555402B1CDB  Unknown               Unknown  Unknown
25: libucp.so.0.0.0    00001555402B2BB2  Unknown               Unknown  Unknown
25: libucp.so.0.0.0    00001555402B2DC4  ucp_proto_select_     Unknown  Unknown
25: libucp.so.0.0.0    00001555402B39A7  ucp_proto_select_     Unknown  Unknown
25: libucp.so.0.0.0    00001555402A30D8  Unknown               Unknown  Unknown
25: libucp.so.0.0.0    00001555402A330E  ucp_worker_get_ep     Unknown  Unknown
25: libucp.so.0.0.0    0000155540309ADD  ucp_wireup_init_l     Unknown  Unknown
25: libucp.so.0.0.0    000015554028CF75  ucp_ep_create_to_     Unknown  Unknown
25: libucp.so.0.0.0    000015554028D714  Unknown               Unknown  Unknown
25: libucp.so.0.0.0    000015554028DB8E  ucp_ep_create         Unknown  Unknown
25: libmpi.so.40.30.3  0000155545AA7607  mca_pml_ucx_add_p     Unknown  Unknown
25: libmpi.so.40.30.3  0000155545B0D723  ompi_mpi_init         Unknown  Unknown
25: libmpi.so.40.30.3  00001555458D004D  MPI_Init              Unknown  Unknown
25: libmpi_mpifh.so.4  0000155545E729D7  PMPI_Init_f08         Unknown  Unknown
25: e3sm.exe           0000000000437E05  cime_comp_mod_mp_         708  cime_comp_mod.F90
25: e3sm.exe           0000000000499955  MAIN__                     63  cime_driver.F90
25: e3sm.exe           0000000000437D22  Unknown               Unknown  Unknown
25: libc-2.28.so       0000155545052D85  __libc_start_main     Unknown  Unknown
25: e3sm.exe           0000000000437C2E  Unknown               Unknown  Unknown

But scream nightlies are also getting that error, and Rob mentioned an upgrade to chrys drivers that is causing issues, with a fix worked on by ANL folks. No need to sweat on chrys fails (yet).

@rljacob
Copy link
Member

rljacob commented Mar 21, 2024

@bartgol Chrysalis had some some updates last week that may have caused the MPI fails. Please try your tests again.

@rljacob
Copy link
Member

rljacob commented Apr 12, 2024

@bartgol how is this going?

@bartgol
Copy link
Contributor Author

bartgol commented Apr 16, 2024

@rljacob I was out almost 2 weeks due to knee surgery. I am back now, and this is a priority on my todo list. I think I just need to check EAMxx testing on frontier, and then we can integrate. It's a pain to test so many testsuites manually, since by the time I figure out the fix for one DIFF/FAIL, some other build will fail due to master baselines being updated (forcing a rebase). So as soon as I confirm that eamxx on frontier is ok, I would like to merge to next, to start integration.

@bartgol bartgol force-pushed the bartgol/e3sm/kokkos-4.2 branch 2 times, most recently from 8794f8b to 3bc66c1 Compare April 25, 2024 18:32
bartgol added 5 commits April 25, 2024 12:32
Besides doing nothing, InitArguments is deprecated in Kokkos 4.0
Kokkos 4.0 no longer allows use of volatile in this context
Some of the exec spaces static methods are no longer static
@bartgol bartgol force-pushed the bartgol/e3sm/kokkos-4.2 branch from 3bc66c1 to 74611c9 Compare April 25, 2024 18:33
@bartgol
Copy link
Contributor Author

bartgol commented Apr 25, 2024

@rljacob I think this branch is ready for integration. Can we pipeline it? I think there were 2 diffs in total, but keeping up with rebases was a pain, so I'd like to give it a shot with next testing...

@rljacob
Copy link
Member

rljacob commented Apr 25, 2024

pipeline it? github says there's no conflicts.

@bartgol
Copy link
Contributor Author

bartgol commented Apr 25, 2024

I mean, I don't know if next is open, and/or if other PRs were already scheduled for integration. I just want this to be put in line.

Pinging @jgfouca as well, since he's the assignee.

@bartgol
Copy link
Contributor Author

bartgol commented Apr 25, 2024

Btw, @rljacob this PR includes the mod that is pipelined in eamxx via E3SM-Project/scream#2799. Would you like to do a similar PR in E3SM first, and then integrate this PR?

@rljacob
Copy link
Member

rljacob commented Apr 26, 2024

No its ok to be in this PR.

@jgfouca
Copy link
Member

jgfouca commented Apr 29, 2024

Is this ready to merge to next?

@bartgol
Copy link
Contributor Author

bartgol commented Apr 29, 2024

Jim, I think we can merge to next, yes.

jgfouca added a commit that referenced this pull request Apr 29, 2024
Update ekat to a version that has Kokkos 4.2 as submodule

This PR will take time to integrate, I'm opening it so I can keep track of what I check.

e3sm_integration:
 chrysalis (intel): all PASS
 pm-cpu (intel) 127 PASS and 1 DIFF: SMS_Ld2.ne30pg2_r05_IcoswISC30E3r5.BGCEXP_CNTL_CNPECACNT_1850.pm-cpu_intel.elm-bgcexp (which is currently consistently failing in the e3sm_integration_next_intel nightly build)

e3sm_developer:
 pm-cpu (gnu): 75PASS and 1 DIFF: ERP_Ld3.ne4pg2_oQU480.F2010.pm-cpu_gnu

homme_integration
 chrysalis (intel): PASS
 pm-cpu (gnu): builds PASS, stuck in Q for run, so I cancelled it. pm-cpu is not tested in nightlies anyways
eamxx testing (from eamxx repo, with a few additional commits for eamxx)
 v1 (CIME)
  chrysalis (intel): 10 PASS (all scream v1) 5 DIFF (all scream v0)
  pm-cpu (gnu): 3 PASS, 1 DIFF
  frontier PEND
  ascent: no longer part of eamxx nightlies
  pm-gpu (gnugpu): 7 PASS, 5 DIFF. All DIFF are in debug mode, while all non-debug builds pass. I'm trying to understand what's the catch.
 standalone
  mappy (gnu): all PASS
  weaver (gnu+cuda): all PASS

[BFB]
@jgfouca
Copy link
Member

jgfouca commented Apr 29, 2024

Merged to next.

@bartgol
Copy link
Contributor Author

bartgol commented Apr 29, 2024

Update: we reverted the merge to next, since it will likely conflict with #6226 . We will resume integration of this PR once that one is merged.

@jgfouca
Copy link
Member

jgfouca commented Apr 29, 2024

Reverted off of next.

jgfouca added a commit that referenced this pull request May 2, 2024
This PR will take time to integrate, I'm opening it so I can keep track of what I check.

e3sm_integration:
  chrysalis (intel): all PASS
  pm-cpu (intel) 127 PASS and 1 DIFF: SMS_Ld2.ne30pg2_r05_IcoswISC30E3r5.BGCEXP_CNTL_CNPECACNT_1850.pm-cpu_intel.elm-bgcexp (which is currently consistently failing in the  e3sm_integration_next_intel nightly build)
e3sm_developer:
  pm-cpu (gnu): 75 PASS and 1 DIFF: ERP_Ld3.ne4pg2_oQU480.F2010.pm-cpu_gnu
homme_integration:
 chrysalis (intel): PASS
 pm-cpu (gnu): builds PASS, stuck in Q for run, so I cancelled it. pm-cpu is not tested in nightlies anyways
eamxx testing (from eamxx repo, with a few additional commits for eamxx)
  v1 (CIME)
    chrysalis (intel): 10 PASS (all scream v1) 5 DIFF (all scream v0)
    pm-cpu (gnu): 3 PASS, 1 DIFF
    frontier PEND
    ascent: no longer part of eamxx nightlies
    pm-gpu (gnugpu): 7 PASS, 5 DIFF. All DIFF are in debug mode, while all non-debug builds pass. I'm trying to understand what's the catch.
standalone
  mappy (gnu): all PASS
  weaver (gnu+cuda): all PASS

[BFB]
@jgfouca
Copy link
Member

jgfouca commented May 2, 2024

Merged to next

@bartgol
Copy link
Contributor Author

bartgol commented May 10, 2024

The fails on CDash of next, as of May 9th are a bunch. Excluding the I and G cases, which should not depend on ekat/kokkos, we have the builds listed below. As I go through the builds, I'll add an explanation of the fails next to them, and if they are not this PR's fault, I'll check them out

pm-cpu, e3sm_integration_next_intel:

  • SMS_Ln3.ne4pg2_ne4pg2.F2010-MMF2.pm-cpu_intel: build FAIL, but across builds and also in master.
  • SMS_D_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.pm-cpu_intel.allactive-wcprod: DIFF fail with File 'xyz' had no original counterpart in '<CASE>/run' with suffix ''. next is not generating eam.h5 output stream. Not this PR's fault.
  • SMS_D_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370.pm-cpu_intel.allactive-wcprodssp: next is not generating eam.h5 and eam.h6 output streams. Not this PR's fault.

chrysalis, e3sm_integration_next_intel:

  • SMS_D_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370.chrysalis_intel.allactive-wcprodssp: fails in master as well
  • SMS_Ln3.ne4pg2_ne4pg2.F2010-MMF2.chrysalis_intel: fails in master as well

pm-cpu, e3sm_prod_next_intel:

  • SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370.pm-cpu_intel.allactive-wcprodssp: next is not generating eam.h5 and eam.h6 output streams. Not this PR's fault.
  • SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP585.pm-cpu_intel.allactive-wcprodssp: FAIL due to problem retrieving input data. Not this PR's fault.

compy, e3sm_prod_next_intel:

  • SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370.compy_intel.allactive-wcprodssp: FAIL due to problem retrieving input data. Not this PR's fault.
  • SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP585.compy_intel.allactive-wcprodssp: next is not generating eam.h5 and eam.h6 output streams. Not this PR's fault.

mappy, e3sm_developer_next_gnu:

  • SMS_D_Ln5.ne4pg2_oQU480.F2010.mappy_gnu: I get a segfault in both next and master
  • SMS_R_Ld5.ne4_ne4.FSCM-ARM97.mappy_gnu.eam-scm: I get same DIFF in next and master

anvil, e3sm_prod_next_intel: all thee jobs seem to hit some batch scheduler issue. They either get canceled while running, or they are submitted but never produce any log in RUNDIR. It has been like this for a few days. I'm thinking it's nothing to do with this PR.

  • SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.F20TR.anvil_intel.eam-wcprod_F20TR
  • SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850-1pctCO2.anvil_intel.allactive-wcprod_1850_1pctCO2
  • SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850-4xCO2.anvil_intel.allactive-wcprod_1850_4xCO2
  • SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.anvil_intel.allactive-wcprod_1850
  • SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370.anvil_intel.allactive-wcprodssp
  • SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP585.anvil_intel.allactive-wcprodssp
  • SMS_Ld1_PS.northamericax4v1pg2_WC14to60E2r3.WCYCL1850.anvil_intel.allactive-wcprodrrm_1850
  • SMS_Ln5.ne30pg2_r05_IcoswISC30E3r5.F2010.anvil_intel.eam-wcprod_F2010

bebop, e3sm_extra_coverage_next_intel:

  • ERP_Ld3.ne30pg2_r05_IcoswISC30E3r5.F2010.bebop_intel.allactive-pioroot1
  • ERP_Ld3.ne4pg2_oQU480.F2010.bebop_intel.eam-condidiag_dcape
  • ERP_Ld3.ne4pg2_oQU480.F2010.bebop_intel.eam-condidiag_rhi
  • ERP_Lm3.ne4pg2_oQU480.F2010.bebop_intel
  • ERS_Ld31.ne4pg2_oQU480.F2010.bebop_intel
  • ERS_Ld5.ne30pg2_r05_IcoswISC30E3r5.F2010.bebop_intel.eam-implicit_stress
  • SMS_D_Ln5.ne30pg2_r05_IcoswISC30E3r5.F2010.bebop_intel
  • SMS_D_Ln5.ne45pg2_ne45pg2.FAQP.bebop_intel
  • SMS_D_Ln5.ne4pg2_oQU480.F2010.bebop_intel.eam-implicit_stress
  • SMS_Lm1.ne4pg2_oQU480.F2010.bebop_intel
  • SMS_Ly1.ne4pg2_oQU480.F2010.bebop_intel

pm-cpu, e3sm_superbfb_next_intel:

  • PET_Ld3_D.ne30pg2_EC30to60E2r2.WCYCL1850.pm-cpu_intel.pemod-omp2: now PASSes

@bartgol
Copy link
Contributor Author

bartgol commented May 15, 2024

@jgfouca @rljacob I went through the yellow boxes of the MustPass and MustPass_wBaseline builds on cdash. I only checked F cases, since from what I understand CRYO/G/I cases are not using active atm, so they are not building kokkos.

For all failures I found a reason that seems to be unrelated with this PR. The only builds I can't deem as "ok" (at least from the point of view of merging this PR) are the bebop builds, since we need the new modules PR to go in order for kokkos 4.2 to be happy.

I am thinking that we could merge this PR as is, since the passes with Intel on other platforms make me confident we won't have many surprises once the bebop modules PR goes in (but I will of course keep an eye out, and jump in if F cases still fail due to kokkos shenanigans once that PR goes in).

What are your thoughts?

@rljacob
Copy link
Member

rljacob commented May 15, 2024

Yes its fine to merge this without waiting for the bebop fixes.

@jgfouca jgfouca merged commit ed030dc into master May 15, 2024
22 checks passed
@jgfouca jgfouca deleted the bartgol/e3sm/kokkos-4.2 branch May 15, 2024 15:16
ndkeen added a commit that referenced this pull request May 20, 2024
…ASYNC=OFF' into next (PR #6423)

After #6101 which brings in kokkos 4.2, we see runtime error with a test like:
ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1.pm-gpu_gnugpu

hits runtime error like:

0: (GTL DEBUG: 0) cuIpcGetMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE, line no 97
unfortunately, the tests hitting this error are also hanging...

A fix that seems to work is to add this build flag:
-DKokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=OFF

The fix is merged in another PR for pm-gpu, so this PR just makes same change to muller-gpu.

Fixes #6422
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
HOMME standalone issues with the standalone HOMME code that dont impact E3SM HOMME kokkos
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants