Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPAS framework: runningMeanRemovedIceRunoff field inactive on some PEs #6855

Closed
dqwu opened this issue Dec 17, 2024 · 2 comments · Fixed by #6857
Closed

MPAS framework: runningMeanRemovedIceRunoff field inactive on some PEs #6855

dqwu opened this issue Dec 17, 2024 · 2 comments · Fixed by #6857
Assignees

Comments

@dqwu
Copy link
Contributor

dqwu commented Dec 17, 2024

PR #6696 introduced a new field (runningMeanRemovedIceRunoff) to the MPAS framework. However, this field appears to be inactive (invisible) on some PEs, leading to inconsistent field names passed to PIO_inq_varid calls, as shown below:

! Get variable ID
pio_ierr = PIO_inq_varid(handle % pio_file, trim(fieldname), new_fieldlist_node % fieldhandle % fieldid)
pio_ierr = PIO_inq_varid(handle % pio_file, trim(fieldname), new_fieldlist_node % fieldhandle % field_desc)
if (pio_ierr /= PIO_noerr) then
...

Below is debug output printed by SCORPIO for reproducer 3:

0: DEBUG, PIOc_inq_varid_impl, counter = 1458, ncid = 150, var name = newlyFormedIce, file = /global/cfs/cdirs/e3sm/inputdata/ice/mpas-seaice/IcoswISC30E3r5/mpassi.IcoswISC30E3r5.rstFromG-chrysalis.20231121.nc
1: DEBUG, PIOc_inq_varid_impl, counter = 1458, ncid = 150, var name = newlyFormedIce, file = /global/cfs/cdirs/e3sm/inputdata/ice/mpas-seaice/IcoswISC30E3r5/mpassi.IcoswISC30E3r5.rstFromG-chrysalis.20231121.nc
0: DEBUG, PIOc_inq_varid_impl, counter = 1459, ncid = 150, var name = runningMeanRemovedIceRunoff, file = /global/cfs/cdirs/e3sm/inputdata/ice/mpas-seaice/IcoswISC30E3r5/mpassi.IcoswISC30E3r5.rstFromG-chrysalis.20231121.nc
1: DEBUG, PIOc_inq_varid_impl, counter = 1459, ncid = 150, var name = xtime, file = /global/cfs/cdirs/e3sm/inputdata/ice/mpas-seaice/IcoswISC30E3r5/mpassi.IcoswISC30E3r5.rstFromG-chrysalis.20231121.nc

For counter = 1458, both ranks report the same variable name (newlyFormedIce).

However, for counter = 1459, rank 0 reports runningMeanRemovedIceRunoff, while rank 1 reports xtime. This suggests that runningMeanRemovedIceRunoff is only active on rank 0.

Problem Summary

It appears that the MPAS framework may incorrectly manage fields like runningMeanRemovedIceRunoff, making them inactive on some PEs. This inconsistency can cause out-of-sync MPI_Bcast calls within SCORPIO, resulting in critical errors such as failures in MPIR_CRAY_Bcast_Tree or hanging issues.

This issue is:

  • Easily reproducible with the GNU compiler
  • Difficult to reproduce with the Intel compiler

Reproducers

The SCORPIO feature branch dqwu/mpi_bcast_wrapper is used to detect out-of-sync MPI_Bcast calls and print variable names passed to the PIO_inq_varid interface.

  1. Restart run of ne120 F case with GNU compiler
    Machines: Perlmutter and Frontier
    Configuration: --compset F2010 --res ne120pg2_r05_oECv3

  2. ERS Test with GNU compiler
    ERS_Ld3.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.<machine>_gnu.allactive-nlmaps
    Machine: Frontier (not reproducible on Perlmutter)

  3. Initial run of ne30 WCYCL1850 case with GNU compiler
    Machine: Perlmutter (not reproducible on Frontier)
    Notes: Not always but frequently reproducible

Steps:

git clone https://github.com/E3SM-Project/E3SM.git

cd E3SM
git submodule update --init --recursive

cd externals/scorpio
git fetch origin
git checkout dqwu/mpi_bcast_wrapper

cd ../../cime/scripts
./create_newcase --machine=pm-cpu --compiler=gnu --case WCYCL1850_ne30pg2_r05_IcoswISC30E3r5 --compset WCYCL1850 --res ne30pg2_r05_IcoswISC30E3r5 --walltime 00:05:00 --queue debug
cd WCYCL1850_ne30pg2_r05_IcoswISC30E3r5

cat <<EOF >> user_nl_eam
mfilt = 1
nhtfrq = -120
EOF

./xmlchange MAX_TASKS_PER_NODE=56
./xmlchange MAX_MPITASKS_PER_NODE=56
./xmlchange NTASKS=56
./xmlchange STOP_N=1
./xmlchange PIO_STRIDE=28

./case.setup

./case.build

./case.submit
  1. Restart run of ne30 WCYCL1850 case with GNU compiler
    Machine: Frontier

Steps:

...
./create_newcase --machine=frontier --compiler=gnu --case WCYCL1850_ne30pg2_r05_IcoswISC30E3r5 --compset WCYCL1850 --res ne30pg2_r05_IcoswISC30E3r5 --walltime 00:25:00
cd WCYCL1850_ne30pg2_r05_IcoswISC30E3r5

cat <<EOF >> user_nl_eam
mfilt = 1
nhtfrq = -120
EOF

./xmlchange MAX_TASKS_PER_NODE=56
./xmlchange MAX_MPITASKS_PER_NODE=56
./xmlchange NTASKS=56
./xmlchange STOP_N=1
./xmlchange PIO_STRIDE=28
./xmlchange RESUBMIT=1

./case.setup

./case.build

./case.submit
@ndkeen
Copy link
Contributor

ndkeen commented Dec 17, 2024

Note this test also seems to capture the error: ERS_P512x1.ne30pg2_r05_IcoswISC30E3r5.F2010.pm-cpu_gnu

@dqwu
Copy link
Contributor Author

dqwu commented Dec 18, 2024

For reproducer 1 run on Frontier (338 nodes, 21,600 processes), rank 256 and rank 512 passed runningMeanRemovedIceRunoff to PIO_inq_varid, while other ranks did not.

jonbob added a commit that referenced this issue Dec 18, 2024
Fix issue with mpas-seaice restart_contents

A new field runningMeanRemovedIceRunoff was added in PR #6696 that
causes errors in some configurations on pm-cpu and frontier with the gnu
compiler. This moves the new field from a shared "meta-stream" to the
streams themselves because attached package wasn't getting inherited
correctly.

Fixes #6855
[BFB]
@jonbob jonbob closed this as completed in aecc186 Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants