GCHP 14.3.1 Disable writing final checkpoint file? #461

cbutenhoff · 2024-11-12T23:37:35Z

Your name

Chris Butenhoff

Your affiliation

Portland State University

Please provide a clear and concise description of your question or discussion topic.

Although I realize this probably isn't best practice, I am running a number of GCHP jobs from the same run directory so they all write their checkpoint files to the same Restart directory.

If a gcchem_internal_checkpoint file already exists, a job hangs (is not removed from the SLURM job queue) when trying to write its final checkpoint with this error

...

Character Resource Parameter: GCHPchem_INTERNAL_CHECKPOINT_FILE:Restarts/gcchem_internal_checkpoint
Character Resource Parameter: GCHPchem_INTERNAL_CHECKPOINT_TYPE:pnc4
Using parallel NetCDF for file: Restarts/gcchem_internal_checkpoint
CB NetCDF4_FileFormatter.F90 status=         -35
CB NetCDF4_FileFormatter.F90 IOR=        4100
CB NetCDF4_FileFormatter.F90 mode=           4
CB NetCDF4_FileFormatter.F90 NF90_=        4096
CB NetCDF4_FileFormatter.F90 file=Restarts/gcchem_internal_checkpoint
CB NetCDF4_FileFormatter.F90 err=NetCDF: File exists && NC_NOCLOBBER
pe=00000 FAIL at line=00181    NetCDF4_FileFormatter.F90                <status=-35>
pe=00000 FAIL at line=03828    NCIO.F90                                 <status=-35>
pe=00000 FAIL at line=04081    NCIO.F90                                 <status=-35>
pe=00000 FAIL at line=05807    MAPL_Generic.F90                         <status=-35>
pe=00000 FAIL at line=02124    MAPL_Generic.F90                         <status=-35>
pe=00000 FAIL at line=03535    Chem_GridCompMod.F90                     <status=-35>
pe=00000 FAIL at line=01807    MAPL_Generic.F90                         <status=-35>
pe=00000 FAIL at line=02053    MAPL_Generic.F90                         <status=-35>
pe=00000 FAIL at line=00779    GCHP_GridCompMod.F90                     <status=-35>
pe=00000 FAIL at line=01807    MAPL_Generic.F90                         <status=-35>
pe=00000 FAIL at line=00873    MAPL_CapGridComp.F90                     <status=-35>

because ncwrite apparently is set to NO CLOBBER in NetCDF4_FileFormatter.F90 (my print statements).

I can imagine workarounds, including changing the source code to allow clobbering, but is there a way to simply disable writing this final checkpoint file?

The text was updated successfully, but these errors were encountered:

lizziel · 2024-11-13T15:39:11Z

HI @cbutenhoff, I think the only way to turn this off is to edit MAPL source code and recompile. However, why would you want to do this? If gcchem_internal_checkpoint exists then it means either GCHP is actively using it or a run did not complete properly, in which case the run directory will not rename it to the expected input restart format for later use. If you have multiple GCHP runs writing to the same Restarts folder then you might write to the same restart file. If gcchem_internal_checkpoint is only still around because a previous run failed and you allow clobbering it, how would you know that a previous run failed? As you say, using a single run directory for multiple runs at the same time is bad practice, the reason being there are several ways things could go wrong, this issue of the checkpoint being just one.

cbutenhoff · 2024-11-13T23:34:22Z

Thanks @lizziel.

If I have Midrun_Checkpoint=OFF, then GCHP should only write gcchem_internal_checkpoint at the end of the run, correct?

It seems like it is taking a long for this file to write. By monitoring the size of the file with 'ls -l', I found it took at least 30 minutes for this file to write, which prevents the job from exiting from the cluster queue.

For a C24 run, do you know about how long it should take to write this file?

lizziel · 2024-11-14T15:30:45Z

The mid-run checkpoints write to different filenames since each one includes the date. gcchem_internal_checkpoint is reserved for end-of-run restart.

Taking a long time to write the restart could be your MPI, what are you using? We recommend using OpenMPI since IntelMPI can sometimes cause problems like this. Also, how many cores are you using? You could try turning on the restart write O-server in GCHP.rc. Look for entry WRITE_RESTART_BY_OSERVER.

It might also be helpful to look through the existing GCHP GitHub issues. Use the search bar to search for "restart".

lizziel · 2024-11-14T16:02:09Z

To answer your question about writing the restart, it should take a few seconds or less for C24.

cbutenhoff · 2024-11-16T00:27:00Z

Thanks @lizziel.

Something is definitely going on then, because it's taking about 40 minutes to write gcchem_internal_checkpoint. I also noticed that GCHP did not rename the file after writing and it's over 600M. (the C24 restart file from the GCHP distribution is 391M)

I'm using OpenMPI 4.1.4 and 288 cores. I'll try turning on the O-server. This all may just be a consequence of running multiple jobs in the same run directory. I'll see if this repeats if I run a single job.

lizziel · 2024-11-18T18:21:58Z

If gcchem_internal_checkpoint was not renamed at the end then your job either timed out or failed. Do you have log files to share?

cbutenhoff added the category: Question Further information is requested label Nov 12, 2024

lizziel self-assigned this Nov 13, 2024

yantosca added the topic: Restart Files Related to GCHP restart files label Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GCHP 14.3.1 Disable writing final checkpoint file? #461

GCHP 14.3.1 Disable writing final checkpoint file? #461

cbutenhoff commented Nov 12, 2024

lizziel commented Nov 13, 2024

cbutenhoff commented Nov 13, 2024

lizziel commented Nov 14, 2024 •

edited

Loading

lizziel commented Nov 14, 2024

cbutenhoff commented Nov 16, 2024

lizziel commented Nov 18, 2024

GCHP 14.3.1 Disable writing final checkpoint file? #461

GCHP 14.3.1 Disable writing final checkpoint file? #461

Comments

cbutenhoff commented Nov 12, 2024

Your name

Your affiliation

Please provide a clear and concise description of your question or discussion topic.

lizziel commented Nov 13, 2024

cbutenhoff commented Nov 13, 2024

lizziel commented Nov 14, 2024 • edited Loading

lizziel commented Nov 14, 2024

cbutenhoff commented Nov 16, 2024

lizziel commented Nov 18, 2024

lizziel commented Nov 14, 2024 •

edited

Loading