Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent runtime error in init: ERROR: sum of areas on globe does not equal 4*pi on pm-cpu. Solved: at least 2 suspect nodes on machine #6533

Open
ndkeen opened this issue Jul 30, 2024 · 14 comments
Labels
intel Intel compilers pm-cpu Perlmutter at NERSC (CPU-only nodes)

Comments

@ndkeen
Copy link
Contributor

ndkeen commented Jul 30, 2024

One of the standard nightly tests failed last night
ERP_Ln9.ne4pg2_oQU480.WCYCL20TRNS-MMF1.pm-cpu_intel.allactive-mmf_fixed_subcycle

I think this is another intermittent issue similar to #6469

It might happen on the order of of 1 in 20 or 1 in 100?
The error happens in the first run of the ERP -- so I would assume we could reproduce with SMS.

 96:  gfr> Running with dynamics and physics on separate grids (physgrid).
 96: gfr> init nphys  2 check 0 boost_pg1 F
 96:  min/max hybm() coordinates:   0.000000000000000E+000  0.990993638939362
 96:  Running phys_grid_init()
 96:   ERROR: sum of areas on globe does not equal 4*pi
 96:   sum of areas =    12.5663706156013       1.242161928871610E-009
 96:  ERROR: phys_grid

@ndkeen
Copy link
Contributor Author

ndkeen commented Aug 8, 2024

Noting I see this error again with Aug28th test on next.
ERS_Ld3.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.pm-cpu_intel.allactive-nlmaps

With Aug 8th next, the following test hit same error:
SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370.pm-cpu_intel.allactive-wcprodssp

With Aug 15th nest, this test failed with same error:
PET_Ln9_PS.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.pm-cpu_intel.allactive-mach-pet

@mzelinka
Copy link

I too just hit this today. At the bottom of the atm.log file I see:

Running phys_grid_init()
INFO: Non-scalable action: Computing global coords in SE dycore.
INFO: Non-scalable action: Allocating global blocks in SE dycore.
INFO: Non-scalable action: Computing global area in SE dycore.
ERROR: sum of areas on globe does not equal 4*pi
sum of areas = 12.5663706103520 -4.007212339729449E-009
ERROR: phys_grid

@mzelinka
Copy link

I was able to successfully re-run the exact same runscipt later

@ndkeen ndkeen changed the title ERROR: sum of areas on globe does not equal 4*pi with ERP_Ln9.ne4pg2_oQU480.WCYCL20TRNS-MMF1.pm-cpu_intel.allactive-mmf_fixed_subcycle ERROR: sum of areas on globe does not equal 4*pi is encountered intermittently on pm-cpu intel Sep 5, 2024
@ndkeen
Copy link
Contributor Author

ndkeen commented Sep 5, 2024

SMS_Ln3.ne4pg2_oQU480.F2010-MMF2.pm-cpu_intel on Sep 4th next

@ndkeen
Copy link
Contributor Author

ndkeen commented Sep 7, 2024

SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.pm-cpu_intel.allactive-wcprod_1850 on sep 6th
jid 30232471 (includes nid004324)

@whannah1
Copy link
Contributor

whannah1 commented Sep 9, 2024

I had a thought that we might address this issue by allowing the area and weights calculation to be attempted multiple times, so I rewrote the section for this in components/eam/src/physics/cam/phys_grid.F90 as follows:

    allocate( area_d(1:ngcols) )
    allocate( wght_d(1:ngcols) )
    area_d = 0.0_r8
    wght_d = 0.0_r8

    if (single_column .and. .not. scm_multcols) then
      area_d = 4.0_r8*pi
      wght_d = 4.0_r8*pi
    else
      success_area = .false.
      success_wght = .false.
      ntry = 4
      do i=0,ntry-1
         if (.not.success_area .or. .not.success_wght) then
            call get_horiz_grid_d(ngcols, area_d_out=area_d, wght_d_out=wght_d)
            if (.not.success_area) then
               if ( abs( sum(area_d) - 4.0_r8*pi ) <= 1.e-10_r8 ) success_area = .true.
            end if
            if (.not.success_wght) then
               if ( abs( sum(wght_d) - 4.0_r8*pi ) <= 1.e-10_r8 ) success_wght = .true.
            end if
         end if
      end do
      if (.not.success_area) then
         write(iulog,*) ' ERROR: sum of areas on globe does not equal 4*pi'
         write(iulog,*) ' sum of areas = ', sum(area_d), sum(area_d)-4.0_r8*pi
         call endrun('phys_grid')
      end if
      if (.not.success_wght) then
         write(iulog,*) ' ERROR: sum of integration weights on globe does not equal 4*pi'
         write(iulog,*) ' sum of weights = ', sum(wght_d), sum(wght_d)-4.0_r8*pi
         call endrun('phys_grid')
      end if
    endif

    do lcid=begchunk,endchunk
       do i=1,lchunks(lcid)%ncols
          lchunks(lcid)%area(i) = area_d(lchunks(lcid)%gcol(i))
          lchunks(lcid)%wght(i) = wght_d(lchunks(lcid)%gcol(i))
       enddo
    enddo

    deallocate( area_d )
    deallocate( wght_d )

I ran a few tests where I disabled the first check of the ntry for loop, set ntry=40, and printed the sum of the areas to see if they varied at all over multiple MPI gathers - but there was no detectable variation in the values, as shown below.

 WHDEBUG: i:           0  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:           1  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:           2  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:           3  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:           4  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:           5  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:           6  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:           7  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:           8  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:           9  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          10  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          11  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          12  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          13  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          14  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          15  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          16  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          17  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          18  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          19  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          20  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          21  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          22  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          23  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          24  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          25  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          26  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          27  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          28  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          29  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          30  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          31  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          32  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          33  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          34  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          35  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          36  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          37  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          38  abs(sum(area_d)-4*pi):   7.4606987254810520E-014
 WHDEBUG: i:          39  abs(sum(area_d)-4*pi):   7.4606987254810520E-014

Still, I think this sort of thing might be a good way to avoid this error if it is just a random hiccup that is missing up the MPI gather part of the process, just not sure how to test it other than running the same test over and over and seeing if I can catch the values actually changing.

I'm happy to turn this into a quick PR if we think it's a good idea.
Any thoughts from @rljacob @ambrad @ndkeen @mt5555 @mahf708 ?

@mt5555
Copy link
Contributor

mt5555 commented Sep 9, 2024

if this check fails, doesn't it means there is some real data corruption somewhere, usually in the MPI systems. I would worry that means its not safe to continue the run, even if you can get it to pass this check. The fact that it is failing here might just be because this is one of the first checks with an abort.

@ndkeen ndkeen changed the title ERROR: sum of areas on globe does not equal 4*pi is encountered intermittently on pm-cpu intel Intermittent runtime error in init: ERROR: sum of areas on globe does not equal 4*pi with pm-cpu_intel Sep 16, 2024
@ndkeen
Copy link
Contributor Author

ndkeen commented Sep 16, 2024

These 2 tests failed with this error using 9/15 checkout:

PEM_Ln9.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.pm-cpu_intel
SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850-1pctCO2.pm-cpu_intel.allactive-wcprod_1850_1pctCO2

@rljacob
Copy link
Member

rljacob commented Sep 16, 2024

Similar error to #6469

@ndkeen
Copy link
Contributor Author

ndkeen commented Oct 7, 2024

After the Intel compiler update, we have not seen any more fails of this sort.. until last night. With next of Oct7th, I see 2:

ERS_Ld3.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.pm-cpu_intel.allactive-nlmaps
SMS_Ld2.ne30pg2_r05_IcoswISC30E3r5.BGCEXP_CNTL_CNPRDCTC_1850.pm-cpu_intel.elm-bgcexp

jobids: 31502419, 31504021

I posted more on the similar (and likely same root cause) issue Rob noted above, but so far, after 500+ cases on muller-cpu, I've not seen any fails like these. This is with newer slingshot software that will be coming to perlmutter soon.
However, on muller-cpu, with newer slingshot, I'm seeing that using FI_MR_CACHE_MONITOR=kdreg2 is required to avoid some hangs in our init. This setting is also available on pm-cpu now, so I will make PR to try using this.

With testing so far, it does not seem to impact our simulations.

ndk/machinefiles/nersc-use-kdreg2

@ndkeen
Copy link
Contributor Author

ndkeen commented Oct 8, 2024

I think I found at least one "bad node". If I use the compute node nid004324, I can get the errors noted above.

At least for the following test cases:

ERS_Ld3.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.pm-cpu_intel.allactive-nlmaps.sus/run/e3sm.log.31574931.241008-035021: 648:   ERROR: sum of areas on globe does not equal 4*pi
ERS_Vmoab.ne4pg2_oQU480.WCYCL1850NS.pm-cpu_intel.sus/run/e3sm.log.31574955.241008-065103: 96:   ERROR: sum of areas on globe does not equal 4*pi
SMS_Ld2.ne30pg2_r05_IcoswISC30E3r5.BGCEXP_CNTL_CNPRDCTC_1850.pm-cpu_intel.elm-bgcexp.sus/run/e3sm.log.31574941.241008-035104: 648:   ERROR: sum of areas on globe does not equal 4*pi

jobids: 31574955,31574955,31574941

another fail from before that i did not report:

f2010.piCtl.ne120pg2_r025_IcoswISC30E3r5.nofini.r0270.pb/run/e3sm.log.31247419.241001-202908: 9728:   ERROR: sum of areas on globe does not equal 4*pi

I know I looked for suspect nodes with this and/or the related issue, but did not find correlations. There may be more than one.

@ndkeen
Copy link
Contributor Author

ndkeen commented Oct 8, 2024

I pinged Mark Z to see the jobids of his failing tests: 29674471, 27517677
Where the first does include nid004324, but the second does not.

@ndkeen ndkeen added pm-cpu Perlmutter at NERSC (CPU-only nodes) intel Intel compilers labels Oct 8, 2024
@ndkeen
Copy link
Contributor Author

ndkeen commented Oct 16, 2024

More updates on the other issue. To avoid these 2 suspect nodes, can case.submit -a="-x nid004324,nid006855"

@ndkeen ndkeen changed the title Intermittent runtime error in init: ERROR: sum of areas on globe does not equal 4*pi with pm-cpu_intel Intermittent runtime error in init: ERROR: sum of areas on globe does not equal 4*pi on pm-cpu. Solved: at least 2 suspect nodes on machine Oct 16, 2024
@ndkeen
Copy link
Contributor Author

ndkeen commented Dec 12, 2024

Just an update on this issue -- I've been working with NERSC to debug this issue in general. We have nid004324 roped off, but the other is still in the wild. I have a test that (slowly) goes thru each compute node on system and tries a super simple ne4 test. So far, I've tested 2383 unique nodes that are "ok".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
intel Intel compilers pm-cpu Perlmutter at NERSC (CPU-only nodes)
Projects
None yet
Development

No branches or pull requests

5 participants