You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was trying to run some ne1024 benchmark cases on pm-gpu with 512 nodes. Run 3 days with output, write restart on last day. Case completed OK once, but the timing was much slower than expected, so I simply resubmitted. Again the timing was slow (it's ongoing issue, but not the point here) and resubmitted. However, the next 4 jobs all failed with following error:
@mahf708 said he also hit same error (I think also with ne1024 cases on pm-gpu) and found that he needed to remove any partial restart files -- perhaps cice.r in particular.
Now in my case, I'm not trying to read from restarts -- only writing. But for the fails, there were some partial restarts.
I've already removed the restarts (and rpointers) that were there as I'm testing if next job will be ok.
Labelling as eamxx, but not sure if it's specific to eamxx.
The text was updated successfully, but these errors were encountered:
ndkeen
added
SCORPIO
The E3SM I/O library (derived from PIO)
EAMxx
PRs focused on capabilities for EAMxx
pm-gpu
Perlmutter machine at NERSC (GPU nodes)
labels
Dec 10, 2024
ndkeen
changed the title
eamxx: PIO: FATAL ERROR: File create mode is inconsistent among processes when partial restart files exist
eamxx: PIO: FATAL ERROR: File create mode is inconsistent among processes when partial restart files exist
Dec 10, 2024
OK I simply removed all of the partial restarts and resubmitted. Job was ok -- restarts written as expected. This suggests to me, that the code is perhaps looking at existing files and doing some logic before trying to open and write. Which is maybe not what we want.
I was trying to run some ne1024 benchmark cases on pm-gpu with 512 nodes. Run 3 days with output, write restart on last day. Case completed OK once, but the timing was much slower than expected, so I simply resubmitted. Again the timing was slow (it's ongoing issue, but not the point here) and resubmitted. However, the next 4 jobs all failed with following error:
@mahf708 said he also hit same error (I think also with ne1024 cases on pm-gpu) and found that he needed to remove any partial restart files -- perhaps
cice.r
in particular.Now in my case, I'm not trying to read from restarts -- only writing. But for the fails, there were some partial restarts.
I've already removed the restarts (and rpointers) that were there as I'm testing if next job will be ok.
Labelling as eamxx, but not sure if it's specific to eamxx.
The text was updated successfully, but these errors were encountered: