You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PR #6696 introduced a new field (runningMeanRemovedIceRunoff) to the MPAS framework. However, this field appears to be inactive (invisible) on some PEs, leading to inconsistent field names passed to PIO_inq_varid calls, as shown below:
! Get variable ID
pio_ierr = PIO_inq_varid(handle % pio_file, trim(fieldname), new_fieldlist_node % fieldhandle % fieldid)
pio_ierr = PIO_inq_varid(handle % pio_file, trim(fieldname), new_fieldlist_node % fieldhandle % field_desc)
if (pio_ierr /= PIO_noerr) then
...
Below is debug output printed by SCORPIO for reproducer 3:
0: DEBUG, PIOc_inq_varid_impl, counter = 1458, ncid = 150, var name = newlyFormedIce, file = /global/cfs/cdirs/e3sm/inputdata/ice/mpas-seaice/IcoswISC30E3r5/mpassi.IcoswISC30E3r5.rstFromG-chrysalis.20231121.nc
1: DEBUG, PIOc_inq_varid_impl, counter = 1458, ncid = 150, var name = newlyFormedIce, file = /global/cfs/cdirs/e3sm/inputdata/ice/mpas-seaice/IcoswISC30E3r5/mpassi.IcoswISC30E3r5.rstFromG-chrysalis.20231121.nc
0: DEBUG, PIOc_inq_varid_impl, counter = 1459, ncid = 150, var name = runningMeanRemovedIceRunoff, file = /global/cfs/cdirs/e3sm/inputdata/ice/mpas-seaice/IcoswISC30E3r5/mpassi.IcoswISC30E3r5.rstFromG-chrysalis.20231121.nc
1: DEBUG, PIOc_inq_varid_impl, counter = 1459, ncid = 150, var name = xtime, file = /global/cfs/cdirs/e3sm/inputdata/ice/mpas-seaice/IcoswISC30E3r5/mpassi.IcoswISC30E3r5.rstFromG-chrysalis.20231121.nc
For counter = 1458, both ranks report the same variable name (newlyFormedIce).
However, for counter = 1459, rank 0 reports runningMeanRemovedIceRunoff, while rank 1 reports xtime. This suggests that runningMeanRemovedIceRunoff is only active on rank 0.
Problem Summary
It appears that the MPAS framework may incorrectly manage fields like runningMeanRemovedIceRunoff, making them inactive on some PEs. This inconsistency can cause out-of-sync MPI_Bcast calls within SCORPIO, resulting in critical errors such as failures in MPIR_CRAY_Bcast_Tree or hanging issues.
This issue is:
Easily reproducible with the GNU compiler
Difficult to reproduce with the Intel compiler
Reproducers
The SCORPIO feature branch dqwu/mpi_bcast_wrapper is used to detect out-of-sync MPI_Bcast calls and print variable names passed to the PIO_inq_varid interface.
Restart run of ne120 F case with GNU compiler
Machines: Perlmutter and Frontier
Configuration: --compset F2010 --res ne120pg2_r05_oECv3
ERS Test with GNU compiler
ERS_Ld3.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.<machine>_gnu.allactive-nlmaps
Machine: Frontier (not reproducible on Perlmutter)
Initial run of ne30 WCYCL1850 case with GNU compiler
Machine: Perlmutter (not reproducible on Frontier)
Notes: Not always but frequently reproducible
For reproducer 1 run on Frontier (338 nodes, 21,600 processes), rank 256 and rank 512 passed runningMeanRemovedIceRunoff to PIO_inq_varid, while other ranks did not.
Fix issue with mpas-seaice restart_contents
A new field runningMeanRemovedIceRunoff was added in PR #6696 that
causes errors in some configurations on pm-cpu and frontier with the gnu
compiler. This moves the new field from a shared "meta-stream" to the
streams themselves because attached package wasn't getting inherited
correctly.
Fixes#6855
[BFB]
PR #6696 introduced a new field (runningMeanRemovedIceRunoff) to the MPAS framework. However, this field appears to be inactive (invisible) on some PEs, leading to inconsistent field names passed to PIO_inq_varid calls, as shown below:
Below is debug output printed by SCORPIO for reproducer 3:
For counter = 1458, both ranks report the same variable name (newlyFormedIce).
However, for counter = 1459, rank 0 reports runningMeanRemovedIceRunoff, while rank 1 reports xtime. This suggests that runningMeanRemovedIceRunoff is only active on rank 0.
Problem Summary
It appears that the MPAS framework may incorrectly manage fields like runningMeanRemovedIceRunoff, making them inactive on some PEs. This inconsistency can cause out-of-sync MPI_Bcast calls within SCORPIO, resulting in critical errors such as failures in MPIR_CRAY_Bcast_Tree or hanging issues.
This issue is:
Reproducers
The SCORPIO feature branch dqwu/mpi_bcast_wrapper is used to detect out-of-sync MPI_Bcast calls and print variable names passed to the PIO_inq_varid interface.
Restart run of ne120 F case with GNU compiler
Machines: Perlmutter and Frontier
Configuration: --compset F2010 --res ne120pg2_r05_oECv3
ERS Test with GNU compiler
ERS_Ld3.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.<machine>_gnu.allactive-nlmaps
Machine: Frontier (not reproducible on Perlmutter)
Initial run of ne30 WCYCL1850 case with GNU compiler
Machine: Perlmutter (not reproducible on Frontier)
Notes: Not always but frequently reproducible
Steps:
Machine: Frontier
Steps:
The text was updated successfully, but these errors were encountered: