-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent runtime error in init: surfrd_veg_all ERROR: sum of wt_cft not 1.0
on pm-cpu. Solved: at least 2 nodes are suspect
#6469
Comments
I see this again with |
I also ran into this error twice while running an v3.HR F2010 case (out of 6 recent submissions)
Second case: job id 29997275.240831
NDK: For these 2 jobids, the first contains |
And here are the ne512 land-only spin-up runs that encountered the same error. The failure occurs randomly (resubmit, sometimes more than once, can overcome, and in repeated failures, the reported error can occur at different column with different sum). All occur during initialization. The first number after PID is the process element id, as seen in e3sm.log. The runs were using pm-cpu. 28289399.240719-160327 NDK: It was most useful for Wuyin to include job ids here.
|
Minor update: I'm also trying to update the intel compiler (with other module version changes) in #6596 so I will try a few tests with that (but again, if frequency is rare, may not be easy to test if this has any impact at all). |
With next of 9/15, the following tests hit this error:
|
surfrd_veg_all ERROR: sum of wt_cft not 1.0
surfrd_veg_all ERROR: sum of wt_cft not 1.0
surfrd_veg_all ERROR: sum of wt_cft not 1.0
with pm-cpu_intel
As we had several tests hit this error (normally 0, every now and then 1), I tried to see if I could repeat with one of the 1-node tests above I also tried several tests with And then about 15 cases with |
Just to document here that another submission of an F2010 case at ne120 ran into the error:
NDK:noting this job includes the 4324 bad node |
Note that this appears to be more than one type of errors. Back tracing is the same -- calling the same check_sums routine for different fields. |
Thanks. Wuyin also indicated that he is using a version of code that includes the update to Intel compiler version. Since we updated this (~Sep19th), I've not seen any more error of this sort on cdash -- and I've been running quite a few benchmarks jobs on pm-cpu (and muller-cpu, almost identical) with updated compiler version -- no errors like this yet. I certainly didn't think the compiler version would "fix" it. |
I ran into similar errors with the compset F20TR and resolution "ne30pg2_r05_IcoswISC30E3r5"
Any clue to solve this issue? |
I actually had two 270-node F cases that failed. One of each variety:
and
Case: Then while testing a potential fix I found to a different issue in init that I've been struggling with, I have seen two passes with this same 270-node setup. Certainly not conclusive, but this is easy/safe thing to try. OK case: The potential fix/hack of adding MPI_Barrier before a certain MPI_AllReduce described here: With testing on muller-cpu, I've actually been unable to reproduce these errors (of not summing to 1.0) -- the only issue I've had so far are stalls/hangs. I've run 300-400 cases at different resolution/node-counts. |
Wuyin gave me a land-only launch script that he had been using recently on pm-cpu and encountering the error noted above more frequently. I tried on muller-cpu.
I ran it with 8,16,32,64,128,256 nodes and have yet to see a fail of the sort we see on pm-cpu. I'm not yet sure what this means -- seems low percentage that the slingshot changes on muller-cpu could be impacting this. However, I do see hangs in init at 256 nodes. The Barrier hack noted above does not seem to fix, but the libfabric setting does allow it to run ok every time (so far). This might be enough evidence to say the hanging-in-init issue is just different than sum-of-values-not-always-1 issue. |
With my testing on muller-cpu (which again is using newer slighshot SW coming soon to pm-cpu), I'm finding that: Now this might not even be related to this current issue above. It's just something we should consider trying even now on pm-cpu. Can add this to There might be a small perf impact with kdreg2 or maybe nothing -- very similar timing. I can try to better describe what this is doing, but my understanding is that it's something newer HPE is working on. Just adding more info on that env var here for completeness:
|
Danqing had same error with F-case. jobid: 30483554 |
As with the other issue, I do see that it looks like there is at least one "bad node" on pm-cpu. If specifically ask for
jobids:
Note this compute node was not used in some of the other failing jobs above. |
I think I have found the other bad node.
To submit a job that will avoid these 2: Working with NERSC now and they have removed 4324 from pool, but letting me test on it. |
Testing on the 4324 node, I have a learned a few things:
|
I ran e3sm_developer only on
For example, with test
|
Re: ERS_Ld5.T62_oQU240wLI.GMPAS-DIB-IAF-DISMF.pm-cpu_intel, does the MPAS seaice log show anything? There might also be MPAS error files that give details. I base this guess on the stack trace you posted. |
Ah yep, I thought I checked. Indeed that case has a log seaice error:
All of these tests are in |
Random idea: All gnu cases fail with |
Still debugging this. I learned that even I see a Normal node:
bad node:
Drilling down a little more, adding writes in this function:
a job on bad node is different than on normal node. With 96 MPi's, it's always rank2 that is different. Nothing obviously wrong in the code -- what I've been trying to do is create a simple stand-alone reproducer. I have been unable to do so as the tests always seem fine. Have only seen issue with e3sm app. |
Did a little more debugging (and trying to make stand-alone reproducer), before I admitted defeat. NERSC has moved the 4324 node to DEBUG state and will either run some more tests or ask if HPE can. |
surfrd_veg_all ERROR: sum of wt_cft not 1.0
with pm-cpu_intelsurfrd_veg_all ERROR: sum of wt_cft not 1.0
on pm-cpu. Solved: at least 2 nodes are suspect
With this test
SMS_Ld2.ne30pg2_r05_IcoswISC30E3r5.BGCEXP_CNTL_CNPRDCTC_1850.pm-cpu_intel.elm-bgcexp
, I see the following error:and I'm pretty sure I've seen this same error (with same or similar test) before, which may suggest intermittent issue.
I see Rob also ran into #6192
The text was updated successfully, but these errors were encountered: