refactor data step script into library (API) and consumer (CLI) #85

raehik · 2023-08-23T15:11:09Z

There are some pain points with the current data step.

user is not able to select output path
- MLflow places artifacts in the working directory, under mlruns. It uses 2 long random strings.
the mlflow run CLI is clunky
- Appears restrictive -- no mutually-exclusive options?
- CLI is partially defined with argparse in cmip26.py, partially with MLflow (via MLproject, which gets used by mlflow run) i.e. some positional arguments are upgraded to (required) options in MLflow
the top-level data step is defined in a Python script, cmip26.py
- It does clean module calling inside, but as is it's not ready to be packaged up.

This PR largely rewrites the data step. Unused code is removed. Stateful operations (globals) are moved into functions. The top-level script is now just a CLI and a handful of operations, mirroring how one would use it directly in Python.

CLI is cleaner
- You may also pass a YAML file with the CLI options in instead. (Makes sharing configurations much easier.)
Internals are clearer and safer, using Python typing stuff
- e.g. BoundingBox, CO2 increase handling
Whole step functionalized: Python interface is clear, though not explicitly documented

Some of the training step is touched too. Larger refactoring will be in another changeset.

Not done:

Loading and processing of dataset is somewhat general, but various internals still expect CM2.6 (and various CM2.6 coordinates/data variables).
Jupyter notebooks are not updated. MLflow running will not work properly -- they should be replaced by explicit python calls and explicit data locations instead of run IDs.

To-dos:

Move from new.
Move training step refactoring
Tweak training step subdomain loading
Re-add prints, progress bars
How to make --co2-increase flag work in MLproject
- Appears to be a limitation -- rewritten readme to use direct invocation example
Update CLI invocations in documentation

Related work to do post-merge:

Update Jupyter notebooks

raehik · 2023-08-25T10:39:54Z

I think the data step is ready, just needs some touching up before review. I'm adding some work on the training step here too, I'll move it out before review.

raehik · 2023-08-31T15:26:12Z

I can't seem to get the MLflow interface working nicely with the simplified CLI. By simplified, I mean --global_ {0,1}, --co2 {0,1} being replaced with --cyclize, --co2-increase. But that type of no-value option aren't supported by MLproject. I can't tell why, it seems like a very simple feature.

raehik · 2023-09-01T16:07:34Z

On testing, this produces forcing data ~x4 larger than currently. Not sure what sort of errors would result in that, but I can go through the changes again. Lines that touch gaussian_filter and further up the call chain seem most likely.

raehik · 2023-09-11T15:13:09Z

Likely candidates:

eddy_forcing was misused: both forcing_coarse and the edited u_v_dataset were returned as a tuple, but the function signature stated it returned a single dataset, and it was used as such. Maybe my simplifying changed behaviour here...?
scipy.ndimage.gaussian_filter was used weirdly, more erroneous type annotations. Probably fine, but needed some inspection.
...a lib call had the grid and velocities dataset args the wrong way round...

raehik · 2023-09-12T14:30:19Z

No, I misread some clauses, like this early return (debug_mode is unused):

gz21_ocean_momentum/src/gz21_ocean_momentum/data/coarse.py

Lines 192 to 194 in fff986c

    
           if not debug_mode: 
        
               return forcing_coarse 
        
           u_v_dataset = u_v_dataset.merge(adv)

raehik · 2023-09-13T15:05:53Z

There were many small mistakes! I'm now getting identical outputs to main for the same configuration. Need to clean up the history and rejig some code I re-messied.

src/gz21_ocean_momentum/step/data/coarsen.py

raehik · 2023-09-20T13:50:59Z

Cleaned up history and logging/debugging setup, sorted all the to-dos I can (prior-existing ones that I'm unsure how to resolve are annotated and left). Ready for review.

MarionBWeinzierl

The refactored code looks much clearer. There are a couple of changes in the readme, and possibly missed pushes/updates which need fixing. Also, we need to retain the mlflow commands in the Readme, so that the instructions are coherent with the training and inference steps, and all the steps can be run.

README.md

MarionBWeinzierl · 2023-09-20T14:25:51Z

src/gz21_ocean_momentum/step/data/lib.py

+        u_v_dataset = u_v_dataset.fillna(0.0)
+
+    # Interpolate temperature
+    # interp_coords = dict(xt_ocean=u_v_dataset.coords['xu_ocean'],


Can we deleted the commented out code?

MarionBWeinzierl · 2023-09-20T14:27:35Z

src/gz21_ocean_momentum/step/data/lib.py

@@ -39,48 +185,22 @@ def advections(u_v_field: xr.Dataset, grid_data: xr.Dataset):
    adv_x = u * gradient_x["usurf"] + v * gradient_y["usurf"]
    adv_y = u * gradient_x["vsurf"] + v * gradient_y["vsurf"]
    result = xr.Dataset({"adv_x": adv_x, "adv_y": adv_y})
-    # TODO check if we can simply prevent the previous operation from adding
-    # chunks
+    # TODO 2023-09-20: old note from original import: v


Remove Todo and make Github issue? @arthurBarthe , does this comment say anything to anyone anymore?

MarionBWeinzierl · 2023-09-20T14:33:57Z

README.md


 ```
-mlflow run . --experiment-name <name>--env-manager=local \


Please keep the mlflow run instruction, as otherwise the subsequent steps (training, inference), which rely on the experiment and run ids, do not work.

README.md

Rewrite as a library (set of functions) and a CLI.

Cleaner subdomain configuration.

Also locks intake catalog to current HEAD.

No need to repeat sigma according to docs.

Also does more operations up front in the CLI for testing purposes.

raehik · 2023-09-22T15:44:48Z

yoooo it automatically merged? I had no idea that would happen. I rebased dev onto data-step-refactor locally and pushed, and that's been processed as a merge on GitHub!

raehik mentioned this pull request Aug 23, 2023

Refactor "step" scripts #83

Closed

raehik force-pushed the data-step-refactor branch from ab81ea3 to 556b77c Compare August 23, 2023 15:13

raehik changed the title ~~refactor data step~~ refactor data step (CLI/top-level) Aug 23, 2023

raehik mentioned this pull request Aug 25, 2023

Allow using local/cached CM2.6 dataset in data step #86

Open

raehik self-assigned this Aug 29, 2023

raehik force-pushed the data-step-refactor branch 2 times, most recently from 2d8d91c to 34f457f Compare August 31, 2023 14:15

raehik marked this pull request as ready for review September 1, 2023 15:08

raehik mentioned this pull request Sep 14, 2023

cmip26.py script name is mistaken (CM2.6, not CMIP) #87

Closed

mondus reviewed Sep 19, 2023

View reviewed changes

src/gz21_ocean_momentum/step/data/coarsen.py Outdated Show resolved Hide resolved

mondus reviewed Sep 19, 2023

View reviewed changes

src/gz21_ocean_momentum/step/data/coarsen.py Outdated Show resolved Hide resolved

raehik changed the title ~~refactor data step (CLI/top-level)~~ refactor data step script into library (API) and consumer (CLI) Sep 20, 2023

raehik force-pushed the data-step-refactor branch from 112462a to eafd869 Compare September 20, 2023 13:48

raehik requested a review from MarionBWeinzierl September 20, 2023 13:51

MarionBWeinzierl requested changes Sep 20, 2023

View reviewed changes

MarionBWeinzierl reviewed Sep 20, 2023

View reviewed changes

README.md Outdated Show resolved Hide resolved

This was referenced Sep 20, 2023

Catch subdomain configuration errors between training data generation and model training #77

Open

Restructure from scripts #8

Closed

raehik force-pushed the data-step-refactor branch from 6a9f2d2 to 0b28e2e Compare September 22, 2023 13:42

raehik changed the base branch from main to dev September 22, 2023 13:43

raehik added 3 commits September 22, 2023 14:49

Nix flake: use Python 3.11

de98d4a

data step: refactor

07afb00

Rewrite as a library (set of functions) and a CLI.

use new code paths in training step; fix MLproject

a073a38

Cleaner subdomain configuration.

raehik added 17 commits September 22, 2023 14:49

data: enable selecting Pangeo intake catalog

fb9179c

Also locks intake catalog to current HEAD.

MLproject: main->data, rename project

67e2adc

tweak MLproject, readme, data step CLI help

e0e2080

cli/data: re-add logging

bdb3b68

cli/data: +log bounding operation

6e317a0

step/data: simplify gaussian_filter call

eed04b3

No need to repeat sigma according to docs.

cli/data: fix forcing compute call arg order

65beb54

step/data: remove code copied from unused debug

95db10d

Also does more operations up front in the CLI for testing purposes.

step/data: fix coarsening scale args

1e36ce8

step/data/coarsen: cleaning

8ae5f67

step/data: simplify ufunc call

4bc7354

readme: clarify data step CLI

a08f621

step/data: clean up, log more

240cc6b

bounding box: add validate function

f1b47b8

step/data: clean up some to-dos

8b5aa4a

fix data step pytests

c74ea76

MLproject: remove data step

f5e8848

raehik force-pushed the data-step-refactor branch from 0b28e2e to f5e8848 Compare September 22, 2023 15:19

raehik added 4 commits September 22, 2023 16:28

readme: fix example commands

33681b7

readme: +note on --help for data step CLI

9be2292

cli/data: fix bound_dataset call

d89fe97

readme: +note on old GEOS req

b483fd0

raehik merged commit b483fd0 into dev Sep 22, 2023
0 of 6 checks passed

MarionBWeinzierl mentioned this pull request Sep 28, 2023

Remove MLFlow framework #4

Closed

raehik mentioned this pull request Oct 12, 2023

Training step refactor #95

Closed

6 tasks

raehik mentioned this pull request Oct 30, 2023

Refactor data step, inference step, Jupyter notebooks #97

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor data step script into library (API) and consumer (CLI) #85

refactor data step script into library (API) and consumer (CLI) #85

raehik commented Aug 23, 2023 •

edited

Loading

raehik commented Aug 25, 2023

raehik commented Aug 31, 2023

raehik commented Sep 1, 2023

raehik commented Sep 11, 2023

raehik commented Sep 12, 2023

raehik commented Sep 13, 2023

raehik commented Sep 20, 2023

MarionBWeinzierl left a comment

MarionBWeinzierl Sep 20, 2023

MarionBWeinzierl Sep 20, 2023

MarionBWeinzierl Sep 20, 2023

raehik commented Sep 22, 2023


		```
		mlflow run . --experiment-name <name>--env-manager=local \

refactor data step script into library (API) and consumer (CLI) #85

refactor data step script into library (API) and consumer (CLI) #85

Conversation

raehik commented Aug 23, 2023 • edited Loading

raehik commented Aug 25, 2023

raehik commented Aug 31, 2023

raehik commented Sep 1, 2023

raehik commented Sep 11, 2023

raehik commented Sep 12, 2023

raehik commented Sep 13, 2023

raehik commented Sep 20, 2023

MarionBWeinzierl left a comment

Choose a reason for hiding this comment

MarionBWeinzierl Sep 20, 2023

Choose a reason for hiding this comment

MarionBWeinzierl Sep 20, 2023

Choose a reason for hiding this comment

MarionBWeinzierl Sep 20, 2023

Choose a reason for hiding this comment

raehik commented Sep 22, 2023

raehik commented Aug 23, 2023 •

edited

Loading