Intermittent runtime error in init: `ERROR: sum of areas on globe does not equal 4*pi` on pm-cpu. Solved: at least 2 suspect nodes on machine #6533

ndkeen · 2024-07-30T16:37:16Z

One of the standard nightly tests failed last night
ERP_Ln9.ne4pg2_oQU480.WCYCL20TRNS-MMF1.pm-cpu_intel.allactive-mmf_fixed_subcycle

I think this is another intermittent issue similar to #6469

It might happen on the order of of 1 in 20 or 1 in 100?
The error happens in the first run of the ERP -- so I would assume we could reproduce with SMS.

 96:  gfr> Running with dynamics and physics on separate grids (physgrid).
 96: gfr> init nphys  2 check 0 boost_pg1 F
 96:  min/max hybm() coordinates:   0.000000000000000E+000  0.990993638939362
 96:  Running phys_grid_init()
 96:   ERROR: sum of areas on globe does not equal 4*pi
 96:   sum of areas =    12.5663706156013       1.242161928871610E-009
 96:  ERROR: phys_grid

The text was updated successfully, but these errors were encountered:

ndkeen · 2024-08-08T20:28:20Z

Noting I see this error again with Aug28th test on next.
ERS_Ld3.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.pm-cpu_intel.allactive-nlmaps

With Aug 8th next, the following test hit same error:
SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370.pm-cpu_intel.allactive-wcprodssp

With Aug 15th nest, this test failed with same error:
PET_Ln9_PS.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.pm-cpu_intel.allactive-mach-pet

mzelinka · 2024-08-23T18:03:07Z

I too just hit this today. At the bottom of the atm.log file I see:

Running phys_grid_init()
INFO: Non-scalable action: Computing global coords in SE dycore.
INFO: Non-scalable action: Allocating global blocks in SE dycore.
INFO: Non-scalable action: Computing global area in SE dycore.
ERROR: sum of areas on globe does not equal 4*pi
sum of areas = 12.5663706103520 -4.007212339729449E-009
ERROR: phys_grid

mzelinka · 2024-08-23T18:05:19Z

I was able to successfully re-run the exact same runscipt later

ndkeen · 2024-09-05T17:03:48Z

SMS_Ln3.ne4pg2_oQU480.F2010-MMF2.pm-cpu_intel on Sep 4th next

ndkeen · 2024-09-07T17:07:26Z

SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.pm-cpu_intel.allactive-wcprod_1850 on sep 6th
jid 30232471 (includes nid004324)

whannah1 · 2024-09-09T15:26:11Z

I had a thought that we might address this issue by allowing the area and weights calculation to be attempted multiple times, so I rewrote the section for this in components/eam/src/physics/cam/phys_grid.F90 as follows:

    allocate( area_d(1:ngcols) )
    allocate( wght_d(1:ngcols) )
    area_d = 0.0_r8
    wght_d = 0.0_r8

    if (single_column .and. .not. scm_multcols) then
      area_d = 4.0_r8*pi
      wght_d = 4.0_r8*pi
    else
      success_area = .false.
      success_wght = .false.
      ntry = 4
      do i=0,ntry-1
         if (.not.success_area .or. .not.success_wght) then
            call get_horiz_grid_d(ngcols, area_d_out=area_d, wght_d_out=wght_d)
            if (.not.success_area) then
               if ( abs( sum(area_d) - 4.0_r8*pi ) <= 1.e-10_r8 ) success_area = .true.
            end if
            if (.not.success_wght) then
               if ( abs( sum(wght_d) - 4.0_r8*pi ) <= 1.e-10_r8 ) success_wght = .true.
            end if
         end if
      end do
      if (.not.success_area) then
         write(iulog,*) ' ERROR: sum of areas on globe does not equal 4*pi'
         write(iulog,*) ' sum of areas = ', sum(area_d), sum(area_d)-4.0_r8*pi
         call endrun('phys_grid')
      end if
      if (.not.success_wght) then
         write(iulog,*) ' ERROR: sum of integration weights on globe does not equal 4*pi'
         write(iulog,*) ' sum of weights = ', sum(wght_d), sum(wght_d)-4.0_r8*pi
         call endrun('phys_grid')
      end if
    endif

    do lcid=begchunk,endchunk
       do i=1,lchunks(lcid)%ncols
          lchunks(lcid)%area(i) = area_d(lchunks(lcid)%gcol(i))
          lchunks(lcid)%wght(i) = wght_d(lchunks(lcid)%gcol(i))
       enddo
    enddo

    deallocate( area_d )
    deallocate( wght_d )

I ran a few tests where I disabled the first check of the ntry for loop, set ntry=40, and printed the sum of the areas to see if they varied at all over multiple MPI gathers - but there was no detectable variation in the values, as shown below.

 WHDEBUG: i:           0  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:           1  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:           2  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:           3  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:           4  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:           5  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:           6  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:           7  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:           8  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:           9  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          10  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          11  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          12  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          13  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          14  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          15  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          16  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          17  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          18  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          19  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          20  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          21  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          22  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          23  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          24  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          25  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          26  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          27  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          28  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          29  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          30  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          31  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          32  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          33  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          34  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          35  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          36  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          37  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          38  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          39  abs(sum(area_d)-4*pi):   7.4606987254810520E-014

Still, I think this sort of thing might be a good way to avoid this error if it is just a random hiccup that is missing up the MPI gather part of the process, just not sure how to test it other than running the same test over and over and seeing if I can catch the values actually changing.

I'm happy to turn this into a quick PR if we think it's a good idea.
Any thoughts from @rljacob @ambrad @ndkeen @mt5555 @mahf708 ?

mt5555 · 2024-09-09T15:47:36Z

if this check fails, doesn't it means there is some real data corruption somewhere, usually in the MPI systems. I would worry that means its not safe to continue the run, even if you can get it to pass this check. The fact that it is failing here might just be because this is one of the first checks with an abort.

ndkeen · 2024-09-16T16:34:36Z

These 2 tests failed with this error using 9/15 checkout:

PEM_Ln9.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.pm-cpu_intel
SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850-1pctCO2.pm-cpu_intel.allactive-wcprod_1850_1pctCO2

rljacob · 2024-09-16T16:49:06Z

Similar error to #6469

ndkeen · 2024-10-07T23:48:09Z

After the Intel compiler update, we have not seen any more fails of this sort.. until last night. With next of Oct7th, I see 2:

ERS_Ld3.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.pm-cpu_intel.allactive-nlmaps
SMS_Ld2.ne30pg2_r05_IcoswISC30E3r5.BGCEXP_CNTL_CNPRDCTC_1850.pm-cpu_intel.elm-bgcexp

jobids: 31502419, 31504021

I posted more on the similar (and likely same root cause) issue Rob noted above, but so far, after 500+ cases on muller-cpu, I've not seen any fails like these. This is with newer slingshot software that will be coming to perlmutter soon.
However, on muller-cpu, with newer slingshot, I'm seeing that using FI_MR_CACHE_MONITOR=kdreg2 is required to avoid some hangs in our init. This setting is also available on pm-cpu now, so I will make PR to try using this.

With testing so far, it does not seem to impact our simulations.

ndk/machinefiles/nersc-use-kdreg2

ndkeen · 2024-10-08T10:33:50Z

I think I found at least one "bad node". If I use the compute node nid004324, I can get the errors noted above.

At least for the following test cases:

ERS_Ld3.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.pm-cpu_intel.allactive-nlmaps.sus/run/e3sm.log.31574931.241008-035021: 648:   ERROR: sum of areas on globe does not equal 4*pi
ERS_Vmoab.ne4pg2_oQU480.WCYCL1850NS.pm-cpu_intel.sus/run/e3sm.log.31574955.241008-065103: 96:   ERROR: sum of areas on globe does not equal 4*pi
SMS_Ld2.ne30pg2_r05_IcoswISC30E3r5.BGCEXP_CNTL_CNPRDCTC_1850.pm-cpu_intel.elm-bgcexp.sus/run/e3sm.log.31574941.241008-035104: 648:   ERROR: sum of areas on globe does not equal 4*pi

jobids: 31574955,31574955,31574941

another fail from before that i did not report:

f2010.piCtl.ne120pg2_r025_IcoswISC30E3r5.nofini.r0270.pb/run/e3sm.log.31247419.241001-202908: 9728:   ERROR: sum of areas on globe does not equal 4*pi

I know I looked for suspect nodes with this and/or the related issue, but did not find correlations. There may be more than one.

ndkeen · 2024-10-08T17:08:35Z

I pinged Mark Z to see the jobids of his failing tests: 29674471, 27517677
Where the first does include nid004324, but the second does not.

ndkeen · 2024-10-16T18:40:47Z

More updates on the other issue. To avoid these 2 suspect nodes, can case.submit -a="-x nid004324,nid006855"

ndkeen · 2024-12-12T03:31:05Z

Just an update on this issue -- I've been working with NERSC to debug this issue in general. We have nid004324 roped off, but the other is still in the wild. I have a test that (slowly) goes thru each compute node on system and tries a super simple ne4 test. So far, I've tested 2383 unique nodes that are "ok".

ndkeen changed the title ~~ERROR: sum of areas on globe does not equal 4*pi with ERP_Ln9.ne4pg2_oQU480.WCYCL20TRNS-MMF1.pm-cpu_intel.allactive-mmf_fixed_subcycle~~ ERROR: sum of areas on globe does not equal 4*pi is encountered intermittently on pm-cpu intel Sep 5, 2024

ndkeen changed the title ~~ERROR: sum of areas on globe does not equal 4*pi is encountered intermittently on pm-cpu intel~~ Intermittent runtime error in init: ERROR: sum of areas on globe does not equal 4*pi with pm-cpu_intel Sep 16, 2024

ndkeen added pm-cpu Perlmutter at NERSC (CPU-only nodes) intel Intel compilers labels Oct 8, 2024

ndkeen changed the title ~~Intermittent runtime error in init: ERROR: sum of areas on globe does not equal 4*pi with pm-cpu_intel~~ Intermittent runtime error in init: ERROR: sum of areas on globe does not equal 4*pi on pm-cpu. Solved: at least 2 suspect nodes on machine Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent runtime error in init: `ERROR: sum of areas on globe does not equal 4*pi` on pm-cpu. Solved: at least 2 suspect nodes on machine #6533

Intermittent runtime error in init: `ERROR: sum of areas on globe does not equal 4*pi` on pm-cpu. Solved: at least 2 suspect nodes on machine #6533

ndkeen commented Jul 30, 2024

ndkeen commented Aug 8, 2024 •

edited

Loading

mzelinka commented Aug 23, 2024

mzelinka commented Aug 23, 2024

ndkeen commented Sep 5, 2024

ndkeen commented Sep 7, 2024 •

edited

Loading

whannah1 commented Sep 9, 2024

mt5555 commented Sep 9, 2024

ndkeen commented Sep 16, 2024

rljacob commented Sep 16, 2024

ndkeen commented Oct 7, 2024 •

edited

Loading

ndkeen commented Oct 8, 2024 •

edited

Loading

ndkeen commented Oct 8, 2024

ndkeen commented Oct 16, 2024 •

edited

Loading

ndkeen commented Dec 12, 2024

Intermittent runtime error in init: ERROR: sum of areas on globe does not equal 4*pi on pm-cpu. Solved: at least 2 suspect nodes on machine #6533

Intermittent runtime error in init: ERROR: sum of areas on globe does not equal 4*pi on pm-cpu. Solved: at least 2 suspect nodes on machine #6533

Comments

ndkeen commented Jul 30, 2024

ndkeen commented Aug 8, 2024 • edited Loading

mzelinka commented Aug 23, 2024

mzelinka commented Aug 23, 2024

ndkeen commented Sep 5, 2024

ndkeen commented Sep 7, 2024 • edited Loading

whannah1 commented Sep 9, 2024

mt5555 commented Sep 9, 2024

ndkeen commented Sep 16, 2024

rljacob commented Sep 16, 2024

ndkeen commented Oct 7, 2024 • edited Loading

ndkeen commented Oct 8, 2024 • edited Loading

ndkeen commented Oct 8, 2024

ndkeen commented Oct 16, 2024 • edited Loading

ndkeen commented Dec 12, 2024

Intermittent runtime error in init: `ERROR: sum of areas on globe does not equal 4*pi` on pm-cpu. Solved: at least 2 suspect nodes on machine #6533

Intermittent runtime error in init: `ERROR: sum of areas on globe does not equal 4*pi` on pm-cpu. Solved: at least 2 suspect nodes on machine #6533

ndkeen commented Aug 8, 2024 •

edited

Loading

ndkeen commented Sep 7, 2024 •

edited

Loading

ndkeen commented Oct 7, 2024 •

edited

Loading

ndkeen commented Oct 8, 2024 •

edited

Loading

ndkeen commented Oct 16, 2024 •

edited

Loading