Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for date-based restart frequency #363

Merged
merged 5 commits into from
Oct 12, 2023

Conversation

jo-basevi
Copy link
Collaborator

Add support for date-based restart pruning in Payu - should close #358.

As suggested by @aidanheerdegen here, an option is to support a subset of pandas date offset frequency aliases:

Unit Description
MS Month start frequency
YS Year start frequency
M Month frequency
Y Year frequency
W Weekly frequency
D Day frequency
H Hour frequency
T Minute frequency
S Second freqeuncy

So for example, if in config.yaml,

restart_freq: 5YS

Then the earliest restart of the year, every 5 years from the first restart will be retained.

During archive, the function Experiment.prune_restarts() would inspect each restart directory and return a list of restart directories that can removed using integer or date-based restart frequency. For date-based frequency, individual restart directory paths are passed to the model specific driver that can parse the restart files and return a cftime (or standard) datetime object.

In this branch, access-om2 has the model.get_restart_datetime() function implemented, which calls the mom's get_restart_datetime() which parses the ocean_solo.res file for the final datetime of the restart. I used this file as it is what is used in COSIMA's tidy_restarts script and it also contains information on what calendar it is using.

Currently the month (M) and year (Y) frequencies are defined similar to adding relativedelta's month and years. So for example if a datetime was 15/01/2000, after adding an offset of '6M' it would be 15/07/2000. I have only just noticed that 'M' in pandas documentation is actually the end of the month.. Would the end of the month/year be more useful? Also, with the different cftime calendars M and Y could give unexpected results i.e with different day months. For example, I've got some logic for cftime's360_day, noleap and all_leap calendars but nothing yet for the julian calendar. Also would the start of month/year (MS/YS) frequencies be sufficient- Is M and Y frequencies even needed?

Week, day, hour, minute frequencies may not be super useful or even necessary but they were easy to add as adding timedeltas is supported with both cftime and standard datetime calendars.

The first restart datetime is used as a point of reference for adding the first frequency interval. Then the next restart with a datetime at or after this checkpoint is kept and then becomes the reference when adding the next time interval. I think using the most recent "kept" datetime as a reference for the next checkpoint could be better than strict regular intervals from the earliest datetime. As the first restart could eventually be lost due to scratch timeouts and using "YS"/"MS" would still keep the earliest restarts in each year/month.

As one of the motivations for date-based restart frequencies was to make syncing restarts to a remote archive easier as payu would know what restarts could be stored permanently. A way to check if a restart would eventually be removed would be to see if the restart directory was in the list returned by prune_restarts(from_n_restart=restart_history - 1, to_n_restart:self.counter) where restart_history is the config.yaml or default value of how many latest restarts are kept.

Any feedback, questions or suggestions is much appreciated!

@coveralls
Copy link

coveralls commented Sep 15, 2023

Coverage Status

coverage: 45.573% (+2.8%) from 42.772% when pulling bc0aa45 on jo-basevi:358-date-based-frequency into cfd694a on payu-org:master.

@jo-basevi jo-basevi marked this pull request as ready for review September 17, 2023 23:15
Copy link
Collaborator

@aidanheerdegen aidanheerdegen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Thanks for the excellent work.

Sorry it has taken so long to finish this review.

I've still not tried it out with a model, which I would like to do, but I didn't want to hold you up and figured this review was long enough as it was.

payu/calendar.py Outdated Show resolved Hide resolved
test/test_calendar.py Outdated Show resolved Hide resolved
payu/experiment.py Outdated Show resolved Hide resolved
payu/experiment.py Outdated Show resolved Hide resolved
payu/models/accessom2.py Outdated Show resolved Hide resolved
payu/calendar.py Outdated Show resolved Hide resolved
payu/calendar.py Outdated Show resolved Hide resolved
payu/calendar.py Show resolved Hide resolved
payu/experiment.py Outdated Show resolved Hide resolved
payu/models/mom.py Outdated Show resolved Hide resolved
@pep8speaks
Copy link

pep8speaks commented Sep 22, 2023

Hello @jo-basevi! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 209:20: W291 trailing whitespace

Comment last updated at 2023-10-11 00:08:23 UTC

@jo-basevi jo-basevi marked this pull request as draft September 25, 2023 22:56
@jo-basevi jo-basevi force-pushed the 358-date-based-frequency branch from 874b99a to 27fa74d Compare September 27, 2023 05:32
@jo-basevi
Copy link
Collaborator Author

Main changes were removing M and Y offset options which simplified the code quite a bit
as all cftime calendars are supported with YS - Year start and MS - Month start (i.e don't have to worry about the end of month logic..).

Other changes were rewriting the calendar tests and added integration-like tests for parsing MOM restart files and the prune_restarts() method in the Experiment class.

Other Notes
I removed this from the documentation:
" Intermediate restarts are not deleted until a permanently archived restart
has been produced. For example, if we have just completed run 11, then
we keep restart004, restart009, restart010, and restart011.
Restarts 10 through 13 are not deleted until restart014 has been saved."

Because the logic previously was to not delete the last restart_history number of restarts. Intermediate restarts were not deleted until a permanently saved restart was only automatically the case when restart_freq was 5 or less as the default restart_history was also 5. Or if restart_history was also configured to be greater than or equal to restart_freq. I could change it so restart_history is set to be minimum of restart_freq to guarantee immediate restarts aren't pruned too early for integer restart frequency and add it back into the documentation?

Other things I got caught out on was the difference between 12MS and 1YS, as the logic uses the runtime of earliest restart as a reference (not the model start time). Say for example, there were runs with runtime of 1 month, with a mom model start time of 1900/01/01:
(run_date, restart_directory)
(1900/02/01, restart000)
(1900/03/01, restart001)
(1900/04/01, restart002)
...

If using frequency 12MS, the restarts that would be saved are:
(1900/02/01, restart000)
(1901/02/01, restart012)
(1901/02/01, restart024)
...

While 1YS would keep
(1900/02/01, restart000)
(1901/01/01, restart011)
(1901/01/01, restart023)
...

The other thing to keep in mind is that the code currently uses the next kept restart as a reference for the next time interval. This means that restarts saved may not be uniformly a fixed frequency from the first restart if there's irregular time steps.

For restart_freq: 6MS, theres a minimum 6 months between each saved restart. So with using the restart's list from above, the restarts saved would be:
(1900/02/01, restart000)
(1900/09/01, restart002) (earliest restart after 6 months)
(1901/03/01, restart004) (6 months after previous kept restart)

If instead there was a fixed frequency from the first restart, the restarts kept would be:
(1900/02/01, restart000)
(1900/09/01, restart002)
(1901/02/01, restart003)
If restart000 was lost somehow then the logic would break as restart002 would be the new reference and restart003 would be purged in a subsequent run through..

@jo-basevi jo-basevi marked this pull request as ready for review September 27, 2023 06:03
@aidanheerdegen
Copy link
Collaborator

Other Notes I removed this from the documentation: " Intermediate restarts are not deleted until a permanently archived restart has been produced. For example, if we have just completed run 11, then we keep restart004, restart009, restart010, and restart011. Restarts 10 through 13 are not deleted until restart014 has been saved."

Because the logic previously was to not delete the last restart_history number of restarts. Intermediate restarts were not deleted until a permanently saved restart was only automatically the case when restart_freq was 5 or less as the default restart_history was also 5. Or if restart_history was also configured to be greater than or equal to restart_freq. I could change it so restart_history is set to be minimum of restart_freq to guarantee immediate restarts aren't pruned too early for integer restart frequency and add it back into the documentation?

Good catch. Yes i think restart_history used should be min(restart_history, restart_freq). It doesn't make sense to keep five previous restarts if the restart_freq is 2. So that would mean reinstating that bit in the docs, though it does need editing to ay it uses restart_history and keeps min(restart_history, restart_freq) of the most recent history files.

How does this affect date based pruning though? We can keep restart_history as an integer (just keep the n most recent restarts) but then what does min(restart_history, restart_freq)mean? Would you translaterestart_freq` into an integer number of runs? Not sure how they could be determined.

Other things I got caught out on was the difference between 12MS and 1YS

So this is just something that needs to be kept in mind by the user?

The other thing to keep in mind is that the code currently uses the next kept restart as a reference for the next time interval. This means that restarts saved may not be uniformly a fixed frequency from the first restart if there's irregular time steps.

For restart_freq: 6MS, theres a minimum 6 months between each saved restart. So with using the restart's list from above, the restarts saved would be: (1900/02/01, restart000) (1900/09/01, restart002) (earliest restart after 6 months) (1901/03/01, restart004) (6 months after previous kept restart)

The restarts from above were monthly, so I'm confused by (1900/02/01, restart000) (1900/09/01, restart002) shouldn't it be (1900/02/01, restart000) (1900/09/01, restart007)?

If instead there was a fixed frequency from the first restart, the restarts kept would be: (1900/02/01, restart000) (1900/09/01, restart002) (1901/02/01, restart003)

Which is the same as above .. no?

If restart000 was lost somehow then the logic would break as restart002 would be the new reference and restart003 would be purged in a subsequent run through..

I'm still not sure those restart numbers are correct, or if they are I understand why, so it's a bit hard to make an informed comment on this bit.

Just some general questions:

  1. If a user would like to always save restarts at the beginning of a year (or as close as possible) is that possible?
  2. Do we need to save some state about previous inferred restart frequencies to be able to warn users if they make a change that could be deleterious?

@jo-basevi
Copy link
Collaborator Author

I'm still not sure those restart numbers are correct, or if they are I understand why, so it's a bit hard to make an informed comment on this bit.

Sorry that's my bad!! I deleted the list of restarts I was talking about. So here is the list of restarts I was referring to:

(1900/02/01, restart000)
(1900/07/01, restart001)
(1900/09/01, restart002)
(1901/02/01, restart003)
(1901/03/01, restart004)
...

@jo-basevi
Copy link
Collaborator Author

If a user would like to always save restarts at the beginning of a year (or as close as possible) is that possible?

Yes, using 1YS would save the restart where current model run time is the earliest of each year.

Do we need to save some state about previous inferred restart frequencies to be able to warn users if they make a change that could be deleterious?

Would that require saving the previous restart frequency as an enviroment variable? Changing to a longer
date-based period or a shorter integer frequency would not result in deletions. Though it would be tricky comparing an integer frequency with a datebased frequency..

As archive() runs after each experiment run, if there is greater than 1 restart to prune, that will indicate that either:

  • restart_frequency is set to a shorter date-based or higher integer value
  • or that restart_history is changed to a lower value.

Instead of saving previous state, during setup and if archive is enabled, I could inspect what restarts it would prune (but not actually delete them..). If restart_freq and restart_history are the same, it should be an empty list at that stage. If not, emit a warning that the config has changed, and more restarts will be deleted at the next time archive is run.

How does that sound? Running that check in setup would also have the benefit for any warnings regarding invalid restart_freq values.

@jo-basevi
Copy link
Collaborator Author

Good catch. Yes i think restart_history used should be min(restart_history, restart_freq). It doesn't make sense to keep five previous restarts if the restart_freq is 2. So that would mean reinstating that bit in the docs, though it does need editing to ay it uses restart_history and keeps min(restart_history, restart_freq) of the most recent history files.
How does this affect date based pruning though? We can keep restart_history as an integer (just keep the n most recent restarts) but then what does min(restart_history, restart_freq)mean? Would you translaterestart_freq` into an integer number of runs? Not sure how they could be determined.

Ok, an idea for an simple algorithm pattern to keep intermediate restarts is:

restarts_to_prune = []
intermediate_restarts = []
for restart in restarts:
    if restart_should_be_pruned():
        # keep track of intermediate restarts 
        intermediate_restarts.append(intermediate_restarts)
    
    else:  # restart should be saved
        # add prior intermediate restarts to the list of restarts_to_prune
        restarts_to_prune.extend(intermediate_restarts)
        intermediate_restarts = []

This will work for both integer and date_based frequency and wouldn't require a restart_history config variable at all..
If restart_history is useful to keep, it could act as an override option to ensure that the last n restarts are always kept?

@aidanheerdegen
Copy link
Collaborator

How does that sound? Running that check in setup would also have the benefit for any warnings regarding invalid restart_freq values.

That sounds like a great idea, thanks. I definitely think there is potential for Bad ThingsTM to happen, and checks and emitting information to the user is a good idea.

I definitely can see a use-case for someone to decide to change their restart_frequency when their experiment progresses. This could be to reduce disk usage or just because they realise they don't need so many.

But I think we want to make this sort of mass-deletion an interactive only option. So warn that lots of restarts would be the result of the next archive, so make them use a -f/--force flag to make it happen, otherwise no restarts are pruned?

Specifying flags isn't something you can do as part of an automated run, so by default it has to be done interactively.

Jo Basevi and others added 3 commits October 6, 2023 11:05
- Move logic for deciding what restarts to prune to Experiment.prune_restarts()
- Extend access-om2 and mom driver to parse restart files for a cftime datetime
- Add functionality to calendar.py to parse date-based frequency values and add these intervals to different datetimes
- remove logic for Y,M frequency offsets as YS/MS may be sufficient in this case
- rewrite/parametrize calendar tests
- add tests for mom-driver parsing restart files
- add integration tests for prune_restarts and access-om2 model driver
- add documentation on date-based restart freq and restart_history
- refactor make/list/del restart dir test code to common
… pruning

- Moved restart parsing code in Mom driver to Fms so it can also work with Mom6
- Add checks to setup for invalid restart pruning configuration or warn if there's changes
- add --force-prune-restarts flag to run and archive cmds
- add restart_history as an override to how many latest restarts to keep
@jo-basevi jo-basevi force-pushed the 358-date-based-frequency branch from 25b12e8 to 56f45da Compare October 6, 2023 00:13
@jo-basevi
Copy link
Collaborator Author

New Changes

Support for both mom5 and mom6 models

  • Move mom driver restart parsing method to fms driver so it can work with mom6 and mom models (as both have the ocean_solo.res file). The other fms model subclass is GOLD. I’m not too sure whether it has similar file or not.. So the method checks first that the file exists and raises a descriptive error if not.

Intermediate restarts and restart_hisory

  • Intermediate restarts can be kept with both integer and date-based restart pruning
  • However, if restart_history is defined, it works as it did previously and only keeps restart_history number of restarts. This would be useful for only keeping more or less restarts than intermediate restarts.

Force restart pruning

  • At the end of setup, a scan of what restarts would be pruned is run. This will error out if there's any invalid configuration (i.e. unsupported values, or if restart files can't be parsed). It will also warn if more restarts than expected would be pruned, and whether --force-prune-restarts needs to be added to next run/archive. This would be the case if restart_history or restart_freq has been changed to values which would prune more restarts, or if auto archiving is suddenly enabled.
  • Added --force-prune-restarts flag to allow greater deletions of restarts, otherwise no restarts will be deleted. This can be run as payu run --force-prune-restarts or payu archive --force-prune-restarts - if archive is being run manually.

aidanheerdegen
aidanheerdegen previously approved these changes Oct 10, 2023
Copy link
Collaborator

@aidanheerdegen aidanheerdegen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks for all the work.

A couple of comments, maybe a change, but otherwise good to go I think.

docs/source/config.rst Outdated Show resolved Hide resolved
payu/experiment.py Outdated Show resolved Hide resolved
payu/experiment.py Outdated Show resolved Hide resolved
payu/experiment.py Show resolved Hide resolved
test/test_prune_restarts.py Show resolved Hide resolved
test/test_prune_restarts.py Outdated Show resolved Hide resolved
@aidanheerdegen
Copy link
Collaborator

Move mom driver restart parsing method to fms driver so it can work with mom6 and mom models (as both have the ocean_solo.res file). The other fms model subclass is GOLD. I’m not too sure whether it has similar file or not.. So the method checks first that the file exists and raises a descriptive error if not.

I was wrong when I said this was FMS based, I mean, they're all FMS based, and so end up using the same convention, but the naming of the file is in the model driver code, not FMS. Regardless I think it is the right decision to put the code in the FMS driver for the moment.

I guess longer term the best option might be to put it in a mixin class and inherit that from the models that use ocean_solo.res.

- Allow regex to support restart/output dirs with counter > 999
- Add test cases for restart_history smaller/greater than  restart_freq
@jo-basevi jo-basevi force-pushed the 358-date-based-frequency branch from e751869 to bc0aa45 Compare October 11, 2023 00:08
@jo-basevi
Copy link
Collaborator Author

Thanks for the review @aidanheerdegen! I've added in the suggested changes and the small fix to counter > 999. Should close #372

Copy link
Collaborator

@aidanheerdegen aidanheerdegen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@aidanheerdegen
Copy link
Collaborator

I've replied where appropriate and resolved all the conversations, so good to merge.

@jo-basevi jo-basevi merged commit 6cda0d5 into payu-org:master Oct 12, 2023
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support date based restart_freq
4 participants