-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent runtime error in init: ERROR: sum of areas on globe does not equal 4*pi
on pm-cpu. Solved: at least 2 suspect nodes on machine
#6533
Comments
Noting I see this error again with Aug28th test on next. With Aug 8th next, the following test hit same error: With Aug 15th nest, this test failed with same error: |
I too just hit this today. At the bottom of the atm.log file I see:
|
I was able to successfully re-run the exact same runscipt later |
ERP_Ln9.ne4pg2_oQU480.WCYCL20TRNS-MMF1.pm-cpu_intel.allactive-mmf_fixed_subcycle
|
|
I had a thought that we might address this issue by allowing the area and weights calculation to be attempted multiple times, so I rewrote the section for this in
I ran a few tests where I disabled the first check of the
Still, I think this sort of thing might be a good way to avoid this error if it is just a random hiccup that is missing up the MPI gather part of the process, just not sure how to test it other than running the same test over and over and seeing if I can catch the values actually changing. I'm happy to turn this into a quick PR if we think it's a good idea. |
if this check fails, doesn't it means there is some real data corruption somewhere, usually in the MPI systems. I would worry that means its not safe to continue the run, even if you can get it to pass this check. The fact that it is failing here might just be because this is one of the first checks with an abort. |
ERROR: sum of areas on globe does not equal 4*pi
with pm-cpu_intel
These 2 tests failed with this error using 9/15 checkout:
|
Similar error to #6469 |
After the Intel compiler update, we have not seen any more fails of this sort.. until last night. With next of Oct7th, I see 2:
I posted more on the similar (and likely same root cause) issue Rob noted above, but so far, after 500+ cases on muller-cpu, I've not seen any fails like these. This is with newer slingshot software that will be coming to perlmutter soon. With testing so far, it does not seem to impact our simulations.
|
I think I found at least one "bad node". If I use the compute node At least for the following test cases:
another fail from before that i did not report:
I know I looked for suspect nodes with this and/or the related issue, but did not find correlations. There may be more than one. |
I pinged Mark Z to see the jobids of his failing tests: |
More updates on the other issue. To avoid these 2 suspect nodes, can |
ERROR: sum of areas on globe does not equal 4*pi
with pm-cpu_intelERROR: sum of areas on globe does not equal 4*pi
on pm-cpu. Solved: at least 2 suspect nodes on machine
Just an update on this issue -- I've been working with NERSC to debug this issue in general. We have |
One of the standard nightly tests failed last night
ERP_Ln9.ne4pg2_oQU480.WCYCL20TRNS-MMF1.pm-cpu_intel.allactive-mmf_fixed_subcycle
I think this is another intermittent issue similar to #6469
It might happen on the order of of 1 in 20 or 1 in 100?
The error happens in the first run of the ERP -- so I would assume we could reproduce with SMS.
The text was updated successfully, but these errors were encountered: