Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eamxx: PIO: FATAL ERROR: File create mode is inconsistent among processes when partial restart files exist #6843

Open
ndkeen opened this issue Dec 10, 2024 · 2 comments
Labels
EAMxx PRs focused on capabilities for EAMxx pm-gpu Perlmutter machine at NERSC (GPU nodes) SCORPIO The E3SM I/O library (derived from PIO)

Comments

@ndkeen
Copy link
Contributor

ndkeen commented Dec 10, 2024

I was trying to run some ne1024 benchmark cases on pm-gpu with 512 nodes. Run 3 days with output, write restart on last day. Case completed OK once, but the timing was much slower than expected, so I simply resubmitted. Again the timing was slow (it's ongoing issue, but not the point here) and resubmitted. However, the next 4 jobs all failed with following error:

 908: PIO: FATAL ERROR: Aborting... FATAL ERROR: File create mode is inconsistent among processes. (file = cess-v2-cntl.ne1024pg2_ne1024pg2.F2010-SCREAMv1.se18-oct18.n0512t4x111XX1.allyamlhist.3d.wr.sk.dins.cice.r.2019-08-04-00000.nc) (/dvs_ro/cfs/cdirs/\
e3sm/ndk/repos/se18-oct18/externals/scorpio/src/clib/pioc_support.c: 3418)

/pscratch/sd/n/ndk/e3sm_scratch/pm-gpu/se18-oct18/cess-v2-cntl.ne1024pg2_ne1024pg2.F2010-SCREAMv1.se18-oct18.n0512t4x111XX1.allyamlhist.3d.wr.sk.dins

@mahf708 said he also hit same error (I think also with ne1024 cases on pm-gpu) and found that he needed to remove any partial restart files -- perhaps cice.r in particular.

Now in my case, I'm not trying to read from restarts -- only writing. But for the fails, there were some partial restarts.
I've already removed the restarts (and rpointers) that were there as I'm testing if next job will be ok.

Labelling as eamxx, but not sure if it's specific to eamxx.

@ndkeen ndkeen added SCORPIO The E3SM I/O library (derived from PIO) EAMxx PRs focused on capabilities for EAMxx pm-gpu Perlmutter machine at NERSC (GPU nodes) labels Dec 10, 2024
@ndkeen ndkeen changed the title eamxx: PIO: FATAL ERROR: File create mode is inconsistent among processes when partial restart files exist eamxx: PIO: FATAL ERROR: File create mode is inconsistent among processes when partial restart files exist Dec 10, 2024
@mahf708
Copy link
Contributor

mahf708 commented Dec 10, 2024

Yeah, needed to remove cice-related files to get around all of this. Iirc, it was r, rh, and h files needing to be removed, but I didn't take notes

I did hit this with ne256pg2 as well fwiw

@ndkeen
Copy link
Contributor Author

ndkeen commented Dec 12, 2024

OK I simply removed all of the partial restarts and resubmitted. Job was ok -- restarts written as expected. This suggests to me, that the code is perhaps looking at existing files and doing some logic before trying to open and write. Which is maybe not what we want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EAMxx PRs focused on capabilities for EAMxx pm-gpu Perlmutter machine at NERSC (GPU nodes) SCORPIO The E3SM I/O library (derived from PIO)
Projects
None yet
Development

No branches or pull requests

2 participants