CICE in ESM1.5 can repeatedly use the same restart file #538

blimlim · 2024-11-28T00:32:33Z

In the ESM1.5 configurations, we changed the CICE dumpfreq parameter from m (monthly) to y (yearly), reducing the number of restart files produced by CICE.

We'd expect that running ESM1.5 in monthly segments would fail, as there would be no valid CICE restart files for subsequent runs.
This is not the case – instead CICE repeatedly uses the same restart files. The following shows the restart directories produced by payu run -n 3 with a 1 month runtime:

ls restart00*/ice
restart000/ice:
cice_in.nml  iced.01010101  ice.restart_file  ice.restart_file-01001231  input_ice.nml  mice.nc  mice.nc-01001231  restart_date.nml

restart001/ice:
cice_in.nml  iced.01010101  ice.restart_file  ice.restart_file-01001231  input_ice.nml  mice.nc  mice.nc-01001231  restart_date.nml

restart002/ice:
cice_in.nml  iced.01010101  ice.restart_file  ice.restart_file-01001231  input_ice.nml  mice.nc  mice.nc-01001231  restart_date.nml

In each of the above the restart pointer file ice.restart_file points to the ./iced.01010101 file.

The restart_date.nml files do increment as they are calculated by payu:

cat restart00*/ice/restart_date.nml
&coupling
    inidate = 1010201
    init_date = 10101
/
&coupling
    inidate = 1010301
    init_date = 10101
/
&coupling
    inidate = 1010401
    init_date = 10101
/

In the second and third run, no history is written (perhaps due to time mismatches between the restart file and the calendar file).

ls output*/ice
output000/ice:
cice_in.nml    HISTORY     ice_diag_out  iceout086  iceout088  iceout090  iceout092  iceout094  iceout096
debug.root.03  ice_diag.d  iceout085     iceout087  iceout089  iceout091  iceout093  iceout095  input_ice.nml

output001/ice:
cice_in.nml    ice_diag.d    iceout085  iceout087  iceout089  iceout091  iceout093  iceout095  input_ice.nml
debug.root.03  ice_diag_out  iceout086  iceout088  iceout090  iceout092  iceout094  iceout096

output002/ice:
cice_in.nml    ice_diag.d    iceout085  iceout087  iceout089  iceout091  iceout093  iceout095  input_ice.nml
debug.root.03  ice_diag_out  iceout086  iceout088  iceout090  iceout092  iceout094  iceout096

It looks like during archive, the cice driver reads the pointer file ice.restart_file to determine the latest iced.YYYYMMDD restart file, and deletes all others. Since CICE didn't update the pointer, I think the original iced.01010101 is kept and repeatedly used.

payu/payu/models/cice.py

Lines 315 to 327 in 27aac37

    
           if not self.split_paths: 
        
               res_ptr_path = os.path.join(self.restart_path, 'ice.restart_file') 
        
               with open(res_ptr_path) as f: 
        
                   res_name = os.path.basename(f.read()).strip() 
        
               assert os.path.exists(os.path.join(self.restart_path, res_name)) 
        
               # Delete the old restart file (keep the one in ice.restart_file) 
        
               for f in self.get_prior_restart_files(): 
        
                   if f.startswith('iced.'): 
        
                       if f == res_name: 
        
                           continue 
        
                       os.remove(os.path.join(self.restart_path, f))

The text was updated successfully, but these errors were encountered:

anton-seaice · 2024-11-29T03:16:53Z

There is a couple of things we could do:

Have payu check that dump_freq is a whole fraction of the payu run length
Stop using the restart pointer file and get payu to set the restart file in ice_in (based on the run start date)
Have cice check the date in the restart file (not sure if this is included in the binary restart format ??)

Other?

blimlim · 2024-11-29T03:42:18Z

Thanks @anton-seaice!

It looks like payu is already making a new restart pointer in the setup stage, based on the latest iced file it finds in the restart directory:

payu/payu/models/cice.py

Lines 166 to 187 in 27aac37

    
           if self.prior_restart_path: 
        
               # Generate ice.restart_file 
        
               # TODO: better check of restart filename 
        
               iced_restart_file = None 
        
               iced_restart_files = [f for f in self.get_prior_restart_files() 
        
                                     if f.startswith('iced.')] 
        
               if len(iced_restart_files) > 0: 
        
                   iced_restart_file = sorted(iced_restart_files)[-1] 
        
               if iced_restart_file is None: 
        
                   raise FileNotFoundError( 
        
                       f'No iced restart file found in {self.prior_restart_path}') 
        
               res_ptr_path = os.path.join(self.work_init_path, 
        
                                           'ice.restart_file') 
        
               if os.path.islink(res_ptr_path): 
        
                   # If we've linked in a previous pointer it should be deleted 
        
                   os.remove(res_ptr_path) 
        
               with open(res_ptr_path, 'w') as res_ptr: 
        
                   res_dir = self.get_ptr_restart_dir() 
        
                   print(os.path.join(res_dir, iced_restart_file), file=res_ptr)

Do you think it would also work if we instead used the run start date when writing the restart pointer at the start of the run. I guess this similar to your second suggestion.

anton-seaice · 2024-11-29T03:46:35Z

I guess when payu finds the latest restart file, check that the filename matches the start date?

Or possible better ... payu determines the correct filename based on the start date, and checks that the restart file exists before creating a ice restart pointer file (or checking the ice restart pointer has the same date in it ?)

anton-seaice · 2024-11-29T04:01:54Z

Sorry - we have to use the restart pointer file (my option 2. above is not easily possible)

https://github.com/ACCESS-NRI/cice4/blob/e7549ebd2044690a432cc67c1317c81cb194b750/source/ice_restart.F90#L308-L309

blimlim · 2024-11-29T04:23:09Z

Looks like there are a couple of good options:

During setup, payu checks that there is an iced.YYYYMMDD restart file matching the run start date. If one matches, it writes this to the pointer file, and otherwise raises an error.
The binary restart files contain time information in their header. We could read this in, and check that the time contained matches the run start date (instead of relying on just the file name). This would additionally guard against naively changing the date in the filename. It would require us to do some calendar calculations, since the time in the header is given in seconds:

> cicefile = open(cicepath, 'rb')
> header = cicefile.read(24)
> bint, istep0, time, time_forc = struct.unpack('>iidd', header)
> print(time)
3155673600.0

@aidanheerdegen If you have any ideas or preferences, that would be really valuable!

anton-seaice · 2024-11-29T05:18:31Z

If you let me decided, I would say both. I think the question for Aidan is should we read the unformatted binary restart file and checking its value, or just assume that the restart filename is correct / consistent.

Is there a case where folks intentionally need to restart from a different date than the model date ? (and don't change it in the binary restart file)

blimlim · 2024-11-29T05:45:39Z

Good point!

Is there a case where folks intentionally need to restart from a different date than the model date ? (and don't change it in the binary restart file)

I suspect probably not. When the two don't line up it looks like things can go wrong with the history output, e.g. with the earlier example where it didn't write any history.

aidanheerdegen · 2024-12-02T00:57:06Z

Is there a case where folks intentionally need to restart from a different date than the model date ? (and don't change it in the binary restart file)

Actually there is I think. When researchers are doing ensemble runs they sometimes grab restarts using a small time offset, or a time-offset of +/- 1 year, and so need to manipulate the date headers so they're correct.

This is peripherally touched on this forum thread

https://forum.access-hive.org.au/t/ensemble-runs-with-access-cm2/1107/3

I wrote a small fortran program for this purpose when I was working in the CMS team

https://gist.github.com/aidanheerdegen/203af6f6e0a87d1d82704eae9608f099

because the models got out of synch if the time wasn't correct.

I think it's a lot cleaner to just perturb the same restarts with some reproducible noise, so I don't think we have to support having incorrect dates in the restart files, and they can be changed in any case, especially if the expected values are printed to STDOUT.

anton-seaice · 2024-12-04T03:54:33Z

Have payu check that dump_freq from cice is a whole fraction of the payu run length

Is this bit feasible ?

blimlim · 2024-12-04T04:35:27Z

I think it's doable! It's slightly complicated by the config.yaml technically supporting runtimes like:

        years: 0
        months: 2
        days: 31

and if dumpfreq=m, it might or might not be a fraction of the run length depending on the start date.

The best I could come up with is something like:

dump_matches_end_date = False
dump_delta = dumpfreq * dumpfreq_n (as a relativedelta object)
dump_date = start_date

while dump_date < end_date:
    if dump_date == expt_enddate:
        dump_matches_end_date = True
   dump_date = dump_date + dump_delta

if not dump_matches_enddate:
     error out

Would this sort of approach look ok?

anton-seaice · 2024-12-04T04:49:03Z

I think so, looks good :)

blimlim · 2024-12-11T03:11:24Z

Unfortunately checking the dump dates prior to the run looks a bit more complicated than I originally thought, as the way CICE4 chooses when to write dump files is different to what I'd expected.

There are some ACCESS specific calendar calculations here, and it then sets whether to write a restart at a given time step here

It looks like when dumpfreq = y, CICE will write a restart when it crosses into a new calendar year, rather after a year of simulation. Likewise, if dumpfreq=m I think it will write a restart when it crosses into a new month, rather than after a month of simulation (which might be hard to define).

E.g. If we run in monthly segments for 6 months, with dumpfreq=m, dumpfreq_n=1, we get:

restart000/ice:
iced.01010201 ...

restart001/ice:
iced.01010301 ...

restart002/ice:
iced.01010401 ...

restart003/ice:
iced.01010501 ...

restart004/ice:
iced.01010601 ... 

restart005/ice:
iced.01010701  ...

If the run is then continued for 1 year, with dumpfreq=y, dumpfreq_n=1,
the final restart is:

restart006/ice:
iced.01020101

I.e. it wrote a restart when it crossed into the new year, instead of at the end of the year-long simulation.

Given the time constraints (with @aidanheerdegen finishing up for the year at the end of this week), what would you @anton-seaice and @aidanheerdegen think of deferring this additional check to next year, and including just the following checks in the current release:

The iced.YYYYMMDD file in the restart pointer matches the experiment start date.
The time in the binary restart header matches the time since init_date calculated by payu.

I have a working version of the above updates with unit tests in #539, which would be ready for review if we're happy to delay the dumpfreq checks. Let me know what you prefer!

anton-seaice · 2024-12-11T03:33:38Z

Yes thats sounds good. Ill review soon :)

blimlim · 2024-12-11T03:51:04Z

Awesome, thanks @anton-seaice! I'll make a separate issue for adding the dumpfreq checks

blimlim added the bug label Nov 28, 2024

blimlim mentioned this issue Dec 3, 2024

Check CICE4 restart file dates #539

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CICE in ESM1.5 can repeatedly use the same restart file #538

CICE in ESM1.5 can repeatedly use the same restart file #538

blimlim commented Nov 28, 2024 •

edited

Loading

anton-seaice commented Nov 29, 2024

blimlim commented Nov 29, 2024

anton-seaice commented Nov 29, 2024

anton-seaice commented Nov 29, 2024

blimlim commented Nov 29, 2024

anton-seaice commented Nov 29, 2024

blimlim commented Nov 29, 2024

aidanheerdegen commented Dec 2, 2024

anton-seaice commented Dec 4, 2024

blimlim commented Dec 4, 2024

anton-seaice commented Dec 4, 2024

blimlim commented Dec 11, 2024 •

edited

Loading

anton-seaice commented Dec 11, 2024

blimlim commented Dec 11, 2024

CICE in ESM1.5 can repeatedly use the same restart file #538

CICE in ESM1.5 can repeatedly use the same restart file #538

Comments

blimlim commented Nov 28, 2024 • edited Loading

anton-seaice commented Nov 29, 2024

blimlim commented Nov 29, 2024

anton-seaice commented Nov 29, 2024

anton-seaice commented Nov 29, 2024

blimlim commented Nov 29, 2024

anton-seaice commented Nov 29, 2024

blimlim commented Nov 29, 2024

aidanheerdegen commented Dec 2, 2024

anton-seaice commented Dec 4, 2024

blimlim commented Dec 4, 2024

anton-seaice commented Dec 4, 2024

blimlim commented Dec 11, 2024 • edited Loading

anton-seaice commented Dec 11, 2024

blimlim commented Dec 11, 2024

blimlim commented Nov 28, 2024 •

edited

Loading

blimlim commented Dec 11, 2024 •

edited

Loading