Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CICE in ESM1.5 can repeatedly use the same restart file #538

Open
blimlim opened this issue Nov 28, 2024 · 14 comments
Open

CICE in ESM1.5 can repeatedly use the same restart file #538

blimlim opened this issue Nov 28, 2024 · 14 comments
Labels

Comments

@blimlim
Copy link
Contributor

blimlim commented Nov 28, 2024

In the ESM1.5 configurations, we changed the CICE dumpfreq parameter from m (monthly) to y (yearly), reducing the number of restart files produced by CICE.

We'd expect that running ESM1.5 in monthly segments would fail, as there would be no valid CICE restart files for subsequent runs.
This is not the case – instead CICE repeatedly uses the same restart files. The following shows the restart directories produced by payu run -n 3 with a 1 month runtime:

ls restart00*/ice
restart000/ice:
cice_in.nml  iced.01010101  ice.restart_file  ice.restart_file-01001231  input_ice.nml  mice.nc  mice.nc-01001231  restart_date.nml

restart001/ice:
cice_in.nml  iced.01010101  ice.restart_file  ice.restart_file-01001231  input_ice.nml  mice.nc  mice.nc-01001231  restart_date.nml

restart002/ice:
cice_in.nml  iced.01010101  ice.restart_file  ice.restart_file-01001231  input_ice.nml  mice.nc  mice.nc-01001231  restart_date.nml

In each of the above the restart pointer file ice.restart_file points to the ./iced.01010101 file.

The restart_date.nml files do increment as they are calculated by payu:

cat restart00*/ice/restart_date.nml
&coupling
    inidate = 1010201
    init_date = 10101
/
&coupling
    inidate = 1010301
    init_date = 10101
/
&coupling
    inidate = 1010401
    init_date = 10101
/

In the second and third run, no history is written (perhaps due to time mismatches between the restart file and the calendar file).

ls output*/ice
output000/ice:
cice_in.nml    HISTORY     ice_diag_out  iceout086  iceout088  iceout090  iceout092  iceout094  iceout096
debug.root.03  ice_diag.d  iceout085     iceout087  iceout089  iceout091  iceout093  iceout095  input_ice.nml

output001/ice:
cice_in.nml    ice_diag.d    iceout085  iceout087  iceout089  iceout091  iceout093  iceout095  input_ice.nml
debug.root.03  ice_diag_out  iceout086  iceout088  iceout090  iceout092  iceout094  iceout096

output002/ice:
cice_in.nml    ice_diag.d    iceout085  iceout087  iceout089  iceout091  iceout093  iceout095  input_ice.nml
debug.root.03  ice_diag_out  iceout086  iceout088  iceout090  iceout092  iceout094  iceout096

It looks like during archive, the cice driver reads the pointer file ice.restart_file to determine the latest iced.YYYYMMDD restart file, and deletes all others. Since CICE didn't update the pointer, I think the original iced.01010101 is kept and repeatedly used.

payu/payu/models/cice.py

Lines 315 to 327 in 27aac37

if not self.split_paths:
res_ptr_path = os.path.join(self.restart_path, 'ice.restart_file')
with open(res_ptr_path) as f:
res_name = os.path.basename(f.read()).strip()
assert os.path.exists(os.path.join(self.restart_path, res_name))
# Delete the old restart file (keep the one in ice.restart_file)
for f in self.get_prior_restart_files():
if f.startswith('iced.'):
if f == res_name:
continue
os.remove(os.path.join(self.restart_path, f))

@blimlim blimlim added the bug label Nov 28, 2024
@anton-seaice
Copy link
Contributor

There is a couple of things we could do:

  1. Have payu check that dump_freq is a whole fraction of the payu run length
  2. Stop using the restart pointer file and get payu to set the restart file in ice_in (based on the run start date)
  3. Have cice check the date in the restart file (not sure if this is included in the binary restart format ??)

Other?

@blimlim
Copy link
Contributor Author

blimlim commented Nov 29, 2024

Thanks @anton-seaice!

It looks like payu is already making a new restart pointer in the setup stage, based on the latest iced file it finds in the restart directory:

payu/payu/models/cice.py

Lines 166 to 187 in 27aac37

if self.prior_restart_path:
# Generate ice.restart_file
# TODO: better check of restart filename
iced_restart_file = None
iced_restart_files = [f for f in self.get_prior_restart_files()
if f.startswith('iced.')]
if len(iced_restart_files) > 0:
iced_restart_file = sorted(iced_restart_files)[-1]
if iced_restart_file is None:
raise FileNotFoundError(
f'No iced restart file found in {self.prior_restart_path}')
res_ptr_path = os.path.join(self.work_init_path,
'ice.restart_file')
if os.path.islink(res_ptr_path):
# If we've linked in a previous pointer it should be deleted
os.remove(res_ptr_path)
with open(res_ptr_path, 'w') as res_ptr:
res_dir = self.get_ptr_restart_dir()
print(os.path.join(res_dir, iced_restart_file), file=res_ptr)

Do you think it would also work if we instead used the run start date when writing the restart pointer at the start of the run. I guess this similar to your second suggestion.

@anton-seaice
Copy link
Contributor

I guess when payu finds the latest restart file, check that the filename matches the start date?

Or possible better ... payu determines the correct filename based on the start date, and checks that the restart file exists before creating a ice restart pointer file (or checking the ice restart pointer has the same date in it ?)

@anton-seaice
Copy link
Contributor

Sorry - we have to use the restart pointer file (my option 2. above is not easily possible)

https://github.com/ACCESS-NRI/cice4/blob/e7549ebd2044690a432cc67c1317c81cb194b750/source/ice_restart.F90#L308-L309

@blimlim
Copy link
Contributor Author

blimlim commented Nov 29, 2024

Looks like there are a couple of good options:

  1. During setup, payu checks that there is an iced.YYYYMMDD restart file matching the run start date. If one matches, it writes this to the pointer file, and otherwise raises an error.

  2. The binary restart files contain time information in their header. We could read this in, and check that the time contained matches the run start date (instead of relying on just the file name). This would additionally guard against naively changing the date in the filename. It would require us to do some calendar calculations, since the time in the header is given in seconds:

> cicefile = open(cicepath, 'rb')
> header = cicefile.read(24)
> bint, istep0, time, time_forc = struct.unpack('>iidd', header)
> print(time)
3155673600.0

@aidanheerdegen If you have any ideas or preferences, that would be really valuable!

@anton-seaice
Copy link
Contributor

If you let me decided, I would say both. I think the question for Aidan is should we read the unformatted binary restart file and checking its value, or just assume that the restart filename is correct / consistent.

Is there a case where folks intentionally need to restart from a different date than the model date ? (and don't change it in the binary restart file)

@blimlim
Copy link
Contributor Author

blimlim commented Nov 29, 2024

Good point!

Is there a case where folks intentionally need to restart from a different date than the model date ? (and don't change it in the binary restart file)

I suspect probably not. When the two don't line up it looks like things can go wrong with the history output, e.g. with the earlier example where it didn't write any history.

@aidanheerdegen
Copy link
Collaborator

Is there a case where folks intentionally need to restart from a different date than the model date ? (and don't change it in the binary restart file)

Actually there is I think. When researchers are doing ensemble runs they sometimes grab restarts using a small time offset, or a time-offset of +/- 1 year, and so need to manipulate the date headers so they're correct.

This is peripherally touched on this forum thread

https://forum.access-hive.org.au/t/ensemble-runs-with-access-cm2/1107/3

I wrote a small fortran program for this purpose when I was working in the CMS team

https://gist.github.com/aidanheerdegen/203af6f6e0a87d1d82704eae9608f099

because the models got out of synch if the time wasn't correct.

I think it's a lot cleaner to just perturb the same restarts with some reproducible noise, so I don't think we have to support having incorrect dates in the restart files, and they can be changed in any case, especially if the expected values are printed to STDOUT.

@anton-seaice
Copy link
Contributor

Have payu check that dump_freq from cice is a whole fraction of the payu run length

Is this bit feasible ?

@blimlim
Copy link
Contributor Author

blimlim commented Dec 4, 2024

I think it's doable! It's slightly complicated by the config.yaml technically supporting runtimes like:

        years: 0
        months: 2
        days: 31

and if dumpfreq=m, it might or might not be a fraction of the run length depending on the start date.

The best I could come up with is something like:

dump_matches_end_date = False
dump_delta = dumpfreq * dumpfreq_n (as a relativedelta object)
dump_date = start_date

while dump_date < end_date:
    if dump_date == expt_enddate:
        dump_matches_end_date = True
   dump_date = dump_date + dump_delta

if not dump_matches_enddate:
     error out

Would this sort of approach look ok?

@anton-seaice
Copy link
Contributor

I think so, looks good :)

@blimlim
Copy link
Contributor Author

blimlim commented Dec 11, 2024

Unfortunately checking the dump dates prior to the run looks a bit more complicated than I originally thought, as the way CICE4 chooses when to write dump files is different to what I'd expected.

There are some ACCESS specific calendar calculations here, and it then sets whether to write a restart at a given time step here

It looks like when dumpfreq = y, CICE will write a restart when it crosses into a new calendar year, rather after a year of simulation. Likewise, if dumpfreq=m I think it will write a restart when it crosses into a new month, rather than after a month of simulation (which might be hard to define).

E.g. If we run in monthly segments for 6 months, with dumpfreq=m, dumpfreq_n=1, we get:

restart000/ice:
iced.01010201 ...

restart001/ice:
iced.01010301 ...

restart002/ice:
iced.01010401 ...

restart003/ice:
iced.01010501 ...

restart004/ice:
iced.01010601 ... 

restart005/ice:
iced.01010701  ...

If the run is then continued for 1 year, with dumpfreq=y, dumpfreq_n=1,
the final restart is:

restart006/ice:
iced.01020101 

I.e. it wrote a restart when it crossed into the new year, instead of at the end of the year-long simulation.

Given the time constraints (with @aidanheerdegen finishing up for the year at the end of this week), what would you @anton-seaice and @aidanheerdegen think of deferring this additional check to next year, and including just the following checks in the current release:

  1. The iced.YYYYMMDD file in the restart pointer matches the experiment start date.
  2. The time in the binary restart header matches the time since init_date calculated by payu.

I have a working version of the above updates with unit tests in #539, which would be ready for review if we're happy to delay the dumpfreq checks. Let me know what you prefer!

@anton-seaice
Copy link
Contributor

Yes thats sounds good. Ill review soon :)

@blimlim
Copy link
Contributor Author

blimlim commented Dec 11, 2024

Awesome, thanks @anton-seaice! I'll make a separate issue for adding the dumpfreq checks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants