Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor data step script into library (API) and consumer (CLI) #85

Merged
merged 24 commits into from
Sep 22, 2023

Conversation

raehik
Copy link
Contributor

@raehik raehik commented Aug 23, 2023

There are some pain points with the current data step.

  • user is not able to select output path
    • MLflow places artifacts in the working directory, under mlruns. It uses 2 long random strings.
  • the mlflow run CLI is clunky
    • Appears restrictive -- no mutually-exclusive options?
    • CLI is partially defined with argparse in cmip26.py, partially with MLflow (via MLproject, which gets used by mlflow run) i.e. some positional arguments are upgraded to (required) options in MLflow
  • the top-level data step is defined in a Python script, cmip26.py
    • It does clean module calling inside, but as is it's not ready to be packaged up.

This PR largely rewrites the data step. Unused code is removed. Stateful operations (globals) are moved into functions. The top-level script is now just a CLI and a handful of operations, mirroring how one would use it directly in Python.

  • CLI is cleaner
    • You may also pass a YAML file with the CLI options in instead. (Makes sharing configurations much easier.)
  • Internals are clearer and safer, using Python typing stuff
    • e.g. BoundingBox, CO2 increase handling
  • Whole step functionalized: Python interface is clear, though not explicitly documented

Some of the training step is touched too. Larger refactoring will be in another changeset.

Not done:

  • Loading and processing of dataset is somewhat general, but various internals still expect CM2.6 (and various CM2.6 coordinates/data variables).
  • Jupyter notebooks are not updated. MLflow running will not work properly -- they should be replaced by explicit python calls and explicit data locations instead of run IDs.

To-dos:

  • Move from new.
  • Move training step refactoring
  • Tweak training step subdomain loading
  • Re-add prints, progress bars
  • How to make --co2-increase flag work in MLproject
    • Appears to be a limitation -- rewritten readme to use direct invocation example
  • Update CLI invocations in documentation

Related work to do post-merge:

  • Update Jupyter notebooks

@raehik raehik mentioned this pull request Aug 23, 2023
@raehik raehik force-pushed the data-step-refactor branch from ab81ea3 to 556b77c Compare August 23, 2023 15:13
@raehik raehik changed the title refactor data step refactor data step (CLI/top-level) Aug 23, 2023
@raehik
Copy link
Contributor Author

raehik commented Aug 25, 2023

I think the data step is ready, just needs some touching up before review. I'm adding some work on the training step here too, I'll move it out before review.

@raehik raehik self-assigned this Aug 29, 2023
@raehik raehik force-pushed the data-step-refactor branch 2 times, most recently from 2d8d91c to 34f457f Compare August 31, 2023 14:15
@raehik
Copy link
Contributor Author

raehik commented Aug 31, 2023

I can't seem to get the MLflow interface working nicely with the simplified CLI. By simplified, I mean --global_ {0,1}, --co2 {0,1} being replaced with --cyclize, --co2-increase. But that type of no-value option aren't supported by MLproject. I can't tell why, it seems like a very simple feature.

@raehik raehik marked this pull request as ready for review September 1, 2023 15:08
@raehik
Copy link
Contributor Author

raehik commented Sep 1, 2023

On testing, this produces forcing data ~x4 larger than currently. Not sure what sort of errors would result in that, but I can go through the changes again. Lines that touch gaussian_filter and further up the call chain seem most likely.

@raehik
Copy link
Contributor Author

raehik commented Sep 11, 2023

Likely candidates:

  • eddy_forcing was misused: both forcing_coarse and the edited u_v_dataset were returned as a tuple, but the function signature stated it returned a single dataset, and it was used as such. Maybe my simplifying changed behaviour here...?
  • scipy.ndimage.gaussian_filter was used weirdly, more erroneous type annotations. Probably fine, but needed some inspection.
  • ...a lib call had the grid and velocities dataset args the wrong way round...

@raehik
Copy link
Contributor Author

raehik commented Sep 12, 2023

No, I misread some clauses, like this early return (debug_mode is unused):

if not debug_mode:
return forcing_coarse
u_v_dataset = u_v_dataset.merge(adv)

@raehik
Copy link
Contributor Author

raehik commented Sep 13, 2023

There were many small mistakes! I'm now getting identical outputs to main for the same configuration. Need to clean up the history and rejig some code I re-messied.

@raehik raehik changed the title refactor data step (CLI/top-level) refactor data step script into library (API) and consumer (CLI) Sep 20, 2023
@raehik
Copy link
Contributor Author

raehik commented Sep 20, 2023

Cleaned up history and logging/debugging setup, sorted all the to-dos I can (prior-existing ones that I'm unsure how to resolve are annotated and left). Ready for review.

Copy link
Collaborator

@MarionBWeinzierl MarionBWeinzierl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The refactored code looks much clearer. There are a couple of changes in the readme, and possibly missed pushes/updates which need fixing. Also, we need to retain the mlflow commands in the Readme, so that the instructions are coherent with the training and inference steps, and all the steps can be run.

README.md Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
u_v_dataset = u_v_dataset.fillna(0.0)

# Interpolate temperature
# interp_coords = dict(xt_ocean=u_v_dataset.coords['xu_ocean'],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we deleted the commented out code?

@@ -39,48 +185,22 @@ def advections(u_v_field: xr.Dataset, grid_data: xr.Dataset):
adv_x = u * gradient_x["usurf"] + v * gradient_y["usurf"]
adv_y = u * gradient_x["vsurf"] + v * gradient_y["vsurf"]
result = xr.Dataset({"adv_x": adv_x, "adv_y": adv_y})
# TODO check if we can simply prevent the previous operation from adding
# chunks
# TODO 2023-09-20: old note from original import: v
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove Todo and make Github issue? @arthurBarthe , does this comment say anything to anyone anymore?


```
mlflow run . --experiment-name <name>--env-manager=local \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep the mlflow run instruction, as otherwise the subsequent steps (training, inference), which rely on the experiment and run ids, do not work.

README.md Outdated Show resolved Hide resolved
Rewrite as a library (set of functions) and a CLI.
Cleaner subdomain configuration.
@raehik raehik merged commit b483fd0 into dev Sep 22, 2023
0 of 6 checks passed
@raehik
Copy link
Contributor Author

raehik commented Sep 22, 2023

yoooo it automatically merged? I had no idea that would happen. I rebased dev onto data-step-refactor locally and pushed, and that's been processed as a merge on GitHub!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants