Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor data step, inference step, Jupyter notebooks #97

Merged
merged 114 commits into from
Dec 11, 2023
Merged

Conversation

raehik
Copy link
Contributor

@raehik raehik commented Oct 17, 2023

This PR covers multiple "feature" changes.

  • refactor data step: replace CLI, rewrite some library internals, add more documentation
  • refactor train step: replace CLI, refactor & document various data processes
  • add new infer step: load prepped data (low-res ocean velocities) and pre-trained model, predict forcings using model
  • update Jupyter notebooks to use new interfaces (code de-duplication, documenting)

Extra:

  • document parts of the model, training (mirroring some explanations from the paper)

Important to-dos:

  • re-add "testing" script cli/testing.py (needs better name! check?)
    • currently broken due to MLflow param loading. needed to reproduce paper figures, where the 5% of input data we skip during training is predicted on. you can still produce those figures using predictions from cli/infer.py, but you would have to split the data again yourself for the same behaviour. we could rewrite the dataset splitting used in cli/train.py in some way that it is easy to obtain the correct data to use in cli/infer.py.
    • also does fine-tuning. in previous discussions Arthur was OK to remove this, which was convenient for removing extracting MLflow dependency. if we re-added fine-tuning to cli/infer.py too, we could remove cli/testing.py
  • discuss making predictions (using pretrained model)
    • somewhat avoided this-- pointed to paper for discussion on model usage
  • general code & user documentation on forcing dataset shape (co-ord, variable, dimension names)
  • Document how to use new inference mode: download pre-trained model from HuggingFace, load in dataset, run. Link from readme
  • finish last checked-over Jupyter notebook (test_global_control)
    • This notebook needs two datasets:
      • Computed forcings from the CM2.6 dataset which are then coarsened
      • Output of inference (testing) step
    • These are then compared
    • [ ] Provide option to use the ML flow style, or to load from file (e.g., xarray.open_zarr on the forcings and output of inference step).

Closes #87 , #90 , #98 , #4 .

Known bugs:

@raehik raehik force-pushed the data-step-refactor branch from b483fd0 to 87a0cae Compare October 17, 2023 11:41
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@raehik raehik force-pushed the data-step-refactor branch 3 times, most recently from d5755c3 to 60813a6 Compare October 18, 2023 14:29
@raehik raehik changed the title Data step refactor Refactor data step, inference step, Jupyter notebooks Oct 30, 2023
@raehik raehik mentioned this pull request Nov 8, 2023
@raehik raehik force-pushed the data-step-refactor branch 2 times, most recently from 0c93ea2 to ebfcd52 Compare November 9, 2023 13:59
@raehik raehik force-pushed the data-step-refactor branch from 1a5daf0 to 998cf51 Compare November 30, 2023 12:02
@dorchard
Copy link
Collaborator

dorchard commented Nov 30, 2023

Can I request that the module in the cli directory have their _cli_desc put as the first definition after imports with a comment that says # Description of this module or something as they provide a good explanation of what the file is about.

@dorchard
Copy link
Collaborator

The section marked ## Data on HuggingFace probably needs to say something about what kind of data and trained model (i.e., that the moment it is low res) and we can update this later if we can get the high-res.

@raehik raehik linked an issue Dec 5, 2023 that may be closed by this pull request
@raehik raehik mentioned this pull request Dec 6, 2023
@raehik raehik linked an issue Dec 6, 2023 that may be closed by this pull request
Seems that Dask needs the explicit map_blocks to schedule properly.
Otherwise memory usage balloons.
@raehik raehik marked this pull request as ready for review December 6, 2023 16:05
Copy link
Collaborator

@MarionBWeinzierl MarionBWeinzierl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed the changes previously, just a couple of tidy-up comments/questions below.

docs/2021-paper-reproduction.md Outdated Show resolved Hide resolved
src/gz21_ocean_momentum/train/base.py Outdated Show resolved Hide resolved
src/gz21_ocean_momentum/lib/data.py Outdated Show resolved Hide resolved
src/gz21_ocean_momentum/lib/data.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@MarionBWeinzierl MarionBWeinzierl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ok to be merged in

Commented out since first commit.
Uses the wrong variable for temperature in CM2.6
(surface_temp, not surface_temperature).
@raehik raehik merged commit 414aa8e into main Dec 11, 2023
4 of 6 checks passed
@raehik raehik deleted the data-step-refactor branch December 19, 2023 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants