Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add eamxx tests to e3sm suites #6640

Closed
wants to merge 2 commits into from
Closed

Conversation

mahf708
Copy link
Contributor

@mahf708 mahf708 commented Sep 23, 2024

Adds EAMxx tests to e3sm_atm_developer, e3sm_atm_integration, and e3sm_atm_prod.

@mahf708 mahf708 added Testing Anything related to unit/system tests EAMxx PRs focused on capabilities for EAMxx labels Sep 23, 2024
Copy link

github-actions bot commented Sep 23, 2024

PR Preview Action v1.4.8
🚀 Deployed preview to https://E3SM-Project.github.io/E3SM/pr-preview/pr-6640/
on branch gh-pages at 2024-09-30 20:06 UTC

@mahf708 mahf708 requested review from rljacob and jgfouca September 24, 2024 00:33
@rljacob rljacob self-assigned this Sep 24, 2024
cime_config/tests.py Outdated Show resolved Hide resolved
@mahf708 mahf708 requested a review from jgfouca September 24, 2024 15:55
@mahf708 mahf708 force-pushed the mahf708/ig/eamxx-tests-suites branch from 9d1f604 to 956af5d Compare September 30, 2024 20:04
@rljacob
Copy link
Member

rljacob commented Oct 2, 2024

@mahf708 is this ready?

@mahf708
Copy link
Contributor Author

mahf708 commented Oct 2, 2024

@mahf708 is this ready?

yes, we may run into issues once this is on next, but as far as I am concerned, these tests should be passing on all our machines. I will debug if they fail

@rljacob
Copy link
Member

rljacob commented Oct 3, 2024

Question: should we add some/all of the tests that are currently in the various "e3sm_scream" suites to our nightly testing?

@mahf708
Copy link
Contributor Author

mahf708 commented Oct 3, 2024

Question: should we add some/all of the tests that are currently in the various "e3sm_scream" suites to our nightly testing?

Personally, that's my ultimate goal. I wanted to start with these tests because they are more "traditional" and designed piece by piece with your involvement (in PRs here in this repo). I am happy to add more tests to this PR or issue a separate PR later. I would like @jgfouca and @AaronDonahue to weigh in and see what they prefer :)

@rljacob
Copy link
Member

rljacob commented Oct 3, 2024

Adding more test can wait but I think you should create new suite names in preparation for that. The "atm" suites are really EAM suites. It shouldn't be necessary for an EAM developer to make EAMxx tests pass. (Or should it?). So maybe you should make new e3sm_eamxx_developer, e3sm_eamxx_integration, and e3sm_eamxx_prod suites, add the tests you've already added to those, and then include them in the corresponding e3sm full suites.

@mahf708
Copy link
Contributor Author

mahf708 commented Oct 3, 2024

It shouldn't be necessary for an EAM developer to make EAMxx tests pass. (Or should it?)

Recall COSP edits led to EAMxx breaking, so at least one test should be run under atm developer. I also thought we were slowly move "atm" from EAM to EAMxx, no?

@rljacob
Copy link
Member

rljacob commented Oct 3, 2024

Good point. If EAMxx depends on code in components/eam then we need an eamxx test. That transition will take years but we need to build up our eamxx testing soon. But new suites can also wait for another PR.

rljacob added a commit that referenced this pull request Oct 8, 2024
Adds EAMxx tests to e3sm_atm_developer, e3sm_atm_integration, and e3sm_atm_prod.
@mahf708
Copy link
Contributor Author

mahf708 commented Oct 8, 2024

I looked at the errors and I list them below. I think there's some setting HOMME isn't happy with on chrysalis (so it aborts). There's a fail in PIO on pm-cpu that looks like a fluke. There are fails about "too many mpi tasks" on pm-cpu that I have never seen...

Any advice on how to proceed? copying @ndkeen

DIFF:

FAIL:

SMS_D_Ln5.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.chrysalis_intel (run)

SMS_D_Ln5.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.chrysalis_intel

 0: PIO: WARNING:  The user buffer used to read the distributed array is not contiguous. A temporary contiguous buffer will be used to read the data, that can negatively impact the I/O performance. varid =           11 , file id =           60, (spio_darray.F90:163)
 0: PIO: WARNING:  The user buffer used to read the distributed array is not contiguous. A temporary contiguous buffer will be used to read the data, that can negatively impact the I/O performance. varid =           12 , file id =           60, (spio_darray.F90:163)
 0: bfbhash>              0 -5549066689258533191 (Hommexx)
29: e3sm.exe: /gpfs/fs1/home/e3smtest/jenkins/workspace/ACME_chrysalis_next/E3SM/components/homme/src/share/cxx/PpmRemap.hpp:536: lambda []()->auto::operator()()->auto: Assertion `fabs(m_pio(kv.ie, igp, jgp, NUM_PHYSICAL_LEV) - m_pin(kv.ie, igp, jgp, NUM_PHYSICAL_LEV)) < 1.0' failed.
24: e3sm.exe: /gpfs/fs1/home/e3smtest/jenkins/workspace/ACME_chrysalis_next/E3SM/components/homme/src/share/cxx/PpmRemap.hpp:536: lambda []()->auto::operator()()->auto: Assertion `fabs(m_pio(kv.ie, igp, jgp, NUM_PHYSICAL_LEV) - m_pin(kv.ie, igp, jgp, NUM_PHYSICAL_LEV)) < 1.0' failed.
 9: e3sm.exe: /gpfs/fs1/home/e3smtest/jenkins/workspace/ACME_chrysalis_next/E3SM/components/homme/src/share/cxx/PpmRemap.hpp:536: lambda []()->auto::operator()()->auto: Assertion `fabs(m_pio(kv.ie, igp, jgp, NUM_PHYSICAL_LEV) - m_pin(kv.ie, igp, jgp, NUM_PHYSICAL_LEV)) < 1.0' failed.
10: e3sm.exe: /gpfs/fs1/home/e3smtest/jenkins/workspace/ACME_chrysalis_next/E3SM/components/homme/src/share/cxx/PpmRemap.hpp:536: lambda []()->auto::operator()()->auto: Assertion `fabs(m_pio(kv.ie, igp, jgp, NUM_PHYSICAL_LEV) - m_pin(kv.ie, igp, jgp, NUM_PHYSICAL_LEV)) < 1.0' failed.
62: e3sm.exe: /gpfs/fs1/home/e3smtest/jenkins/workspace/ACME_chrysalis_next/E3SM/components/homme/src/share/cxx/PpmRemap.hpp:536: lambda []()->auto::operator()()->auto: Assertion `fabs(m_pio(kv.ie, igp, jgp, NUM_PHYSICAL_LEV) - m_pin(kv.ie, igp, jgp, NUM_PHYSICAL_LEV)) < 1.0' failed.
62: e3sm.exe: /gpfs/fs1/home/e3smtest/jenkins/workspace/ACME_chrysalis_next/E3SM/components/homme/src/share/cxx/PpmRemap.hpp:536: lambda []()->auto::operator()()->auto: Assertion `fabs(m_pio(kv.ie, igp, jgp, NUM_PHYSICAL_LEV) - m_pin(kv.ie, igp, jgp, NUM_PHYSICAL_LEV)) < 1.0' failed.
62: [chr-0506:2178465:0:2178550] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))

ERS_Ld5.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.chrysalis_intel.eamxx-prod (run)

ERS_Ld5.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.chrysalis_intel.eamxx-prod

 0: bfbhash>              0 -5549066689258533027 (Hommexx)
41: terminate called after throwing an instance of 'std::logic_error'
41:   what():  /gpfs/fs1/home/e3smtest/jenkins/workspace/ACME_chrysalis_next/E3SM/components/homme/src/share/compose/compose_slmm_islmpi_adp.cpp:56: The condition:
41: true
41: led to the exception
41: Departure point is outside of halo:
41:   nearest point permitted: 1

SMS_D_Ln5.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.pm-cpu_nvidia (build)

SMS_D_Ln5.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.pm-cpu_nvidia

NVFORTRAN-S-0081-Illegal selector - KIND parameter has unknown value for data type  (/global/u2/e/e3smtest/jenkins/workspace/ACME_Perlmutter_next_nvidia/E3SM/components/homme/src/share/physical_constants.F90: 57)
NVFORTRAN-S-0081-Illegal selector - KIND parameter has unknown value for data type  (/global/u2/e/e3smtest/jenkins/workspace/ACME_Perlmutter_next_nvidia/E3SM/components/homme/src/share/physical_constants.F90: 57)
  0 inform,   0 warnings,   2 severes, 0 fatal for physical_constants
Target CMakeFiles/theta-l_kokkos_4_72_10.dir/__/__/__/homme/src/share/physical_constants.F90.o built in 0.073516 seconds
gmake[2]: *** [eamxx/src/mct_coupling/CMakeFiles/theta-l_kokkos_4_72_10.dir/build.make:676: eamxx/src/mct_coupling/CMakeFiles/theta-l_kokkos_4_72_10.dir/__/__/__/homme/src/share/physical_constants.F90.o] Error 2
gmake[2]: *** Waiting for unfinished jobs....

SMS_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-cpu_intel.eamxx-prod (run)

SMS_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-cpu_intel.eamxx-prod

 0: CalcWorkPerBlock: Total blocks:        270 Ice blocks:        270 IceFree blocks:          0 Land blocks:          0
  0: bfbhash>              0 -6271907791918625831 (Hommexx)
  0: MPIIO WARNING: DVS stripe width of 2 was requested but DVS set it to 1
  0: See MPICH_MPIIO_DVS_MAXNODES in the intro_mpi man page.
  0: bfbhash>             18 -8415750526090314275 (Hommexx)
  0: bfbhash>             36 -6188144623516052216 (Hommexx)
  0: PIO: FATAL ERROR: Aborting... An error occured, Writing variable SeaLevelPressure, varid=2, (total number of variables = 36) to file SMS_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-cpu_intel.eamxx-prod.C.JNextProd20241007_205257.scream.6hourlyINST_native.h.INSTANT.nhours_x6.1994-10-01-00000.nc (ncid=29) using serial I/O failed.. NetCDF: Numeric conversion not representable (err=-60). Aborting since the error handler was set to PIO_INTERNAL_ERROR... (/global/u2/e/e3smtest/jenkins/workspace/ACME_Perlmutter_Prod/E3SM/externals/scorpio/src/clib/pio_darray_int.c: 984)
  0: Obtained 10 stack frames.
  0: /pscratch/sd/e/e3smtest/e3sm_scratch/pm-cpu/SMS_Ld1.ne30pg2_ne30pg2.F2010-SCREAMv1.pm-cpu_intel.eamxx-prod.C.JNextProd20241007_205257/bld/e3sm.exe() [0x26386bd]

ERS_Ld5.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.pm-cpu_intel.eamxx-prod + SMS_D_Ln5.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.pm-cpu_gnu (run)

ERS_Ld5.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.pm-cpu_intel.eamxx-prod

SMS_D_Ln5.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.pm-cpu_gnu

  0:  number of MPI processes per node: min,max=         128         128
  2:            3  ABORTING WITH ERROR:
  2:  Error: too many MPI tasks. set dyn_npes <= nelem
  2: MPICH ERROR [Rank 2] [job id 31567733.0] [Mon Oct  7 22:38:17 2024] [nid005813] - Abort(32766) (rank 2 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 32766) - process 2
  2:
  2: aborting job:

@rljacob
Copy link
Member

rljacob commented Oct 8, 2024

maybe @oksanaguba can help with the HOMME errors on chrysalis.

@ambrad
Copy link
Member

ambrad commented Oct 9, 2024

I propose we wait to bring in this PR until the SCREAM and E3SM repos are unified. The problem right now is the EAMxx code in E3SM is out of date w.r.t. the SCREAM repo. Thus, it's unclear what is being tested. In addition, it's possible fixes would require coordination between repos.

If these specific tests should be exercised immediately, I recommend adding them to the SCREAM-repo nightlies so they have a better chance of running out of the box when the repos get unified.

@ndkeen
Copy link
Contributor

ndkeen commented Oct 9, 2024

Just quickly: for the fail with too many MPI procs, this is a known issue with homme in scream. There is code not wanting to go forward if num npi's larger than numelem.

@oksanaguba
Copy link
Contributor

i agree with Andrew that it would be better to wait till repos are merged together.

rljacob added a commit that referenced this pull request Oct 10, 2024
@rljacob
Copy link
Member

rljacob commented Oct 10, 2024

Not actually reverted after a force push. See how it does with #6675

rljacob added a commit that referenced this pull request Oct 10, 2024
@mahf708
Copy link
Contributor Author

mahf708 commented Oct 11, 2024

Thanks everyone for your patience and advice. Closing this PR for now, and I will work on a different PR in the future if needed. (I will make a separate PR for edits related to the github actions workflows.)

@mahf708 mahf708 closed this Oct 11, 2024
@mahf708 mahf708 deleted the mahf708/ig/eamxx-tests-suites branch October 11, 2024 01:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EAMxx PRs focused on capabilities for EAMxx Testing Anything related to unit/system tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants