-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor data step script into library (API) and consumer (CLI) #85
Conversation
ab81ea3
to
556b77c
Compare
I think the data step is ready, just needs some touching up before review. I'm adding some work on the training step here too, I'll move it out before review. |
2d8d91c
to
34f457f
Compare
I can't seem to get the MLflow interface working nicely with the simplified CLI. By simplified, I mean |
On testing, this produces forcing data ~x4 larger than currently. Not sure what sort of errors would result in that, but I can go through the changes again. Lines that touch |
Likely candidates:
|
No, I misread some clauses, like this early return ( gz21_ocean_momentum/src/gz21_ocean_momentum/data/coarse.py Lines 192 to 194 in fff986c
|
There were many small mistakes! I'm now getting identical outputs to |
112462a
to
eafd869
Compare
Cleaned up history and logging/debugging setup, sorted all the to-dos I can (prior-existing ones that I'm unsure how to resolve are annotated and left). Ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The refactored code looks much clearer. There are a couple of changes in the readme, and possibly missed pushes/updates which need fixing. Also, we need to retain the mlflow commands in the Readme, so that the instructions are coherent with the training and inference steps, and all the steps can be run.
u_v_dataset = u_v_dataset.fillna(0.0) | ||
|
||
# Interpolate temperature | ||
# interp_coords = dict(xt_ocean=u_v_dataset.coords['xu_ocean'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we deleted the commented out code?
@@ -39,48 +185,22 @@ def advections(u_v_field: xr.Dataset, grid_data: xr.Dataset): | |||
adv_x = u * gradient_x["usurf"] + v * gradient_y["usurf"] | |||
adv_y = u * gradient_x["vsurf"] + v * gradient_y["vsurf"] | |||
result = xr.Dataset({"adv_x": adv_x, "adv_y": adv_y}) | |||
# TODO check if we can simply prevent the previous operation from adding | |||
# chunks | |||
# TODO 2023-09-20: old note from original import: v |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove Todo and make Github issue? @arthurBarthe , does this comment say anything to anyone anymore?
|
||
``` | ||
mlflow run . --experiment-name <name>--env-manager=local \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please keep the mlflow run instruction, as otherwise the subsequent steps (training, inference), which rely on the experiment and run ids, do not work.
6a9f2d2
to
0b28e2e
Compare
Rewrite as a library (set of functions) and a CLI.
Cleaner subdomain configuration.
Also locks intake catalog to current HEAD.
No need to repeat sigma according to docs.
Also does more operations up front in the CLI for testing purposes.
0b28e2e
to
f5e8848
Compare
yoooo it automatically merged? I had no idea that would happen. I rebased |
There are some pain points with the current data step.
mlruns
. It uses 2 long random strings.mlflow run
CLI is clunkyargparse
incmip26.py
, partially with MLflow (viaMLproject
, which gets used bymlflow run
) i.e. some positional arguments are upgraded to (required) options in MLflowcmip26.py
This PR largely rewrites the data step. Unused code is removed. Stateful operations (globals) are moved into functions. The top-level script is now just a CLI and a handful of operations, mirroring how one would use it directly in Python.
BoundingBox
, CO2 increase handlingSome of the training step is touched too. Larger refactoring will be in another changeset.
Not done:
python
calls and explicit data locations instead of run IDs.To-dos:
new
.--co2-increase
flag work in MLprojectRelated work to do post-merge: