Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation about how to reproduce the figures from the paper. #59

Closed
dorchard opened this issue Jun 28, 2023 · 26 comments · Fixed by #82
Closed

Add documentation about how to reproduce the figures from the paper. #59

dorchard opened this issue Jun 28, 2023 · 26 comments · Fixed by #82
Assignees

Comments

@dorchard
Copy link
Collaborator

Various scripts generate the figures. We should provide an explanation of how to regenerate the key figures from the paper, referring to them by figure name. This could be separated out into a separate script or notebook, as long as its clear from the README where to go to find this.

@dorchard dorchard changed the title Add documentation to the README about how to reproduce the figures from the paper. Add documentation about how to reproduce the figures from the paper. Jun 28, 2023
@MarionBWeinzierl
Copy link
Collaborator

MarionBWeinzierl commented Aug 14, 2023

@arthurBarthe , can you please identify which of the notebooks can be deleted and which are needed to reproduce the paper, and also how to get the data/setup to use them?

For example, there is mention of a load_paper_net() function, but to load data, it must be present somewhere from a run or so. Also, if I try running, for example, https://github.com/m2lines/gz21_ocean_momentum/blob/main/examples/jupyter-notebooks/offline_test_SWM.ipynb with the data from a previous mlflow run (does not work with a run without mlflow), the data is not found - it seems to look for netcdf files, I think, but in mlflow run as described in the README seems to not to create them (although it is marked as FINISHED, not FAILED, in the dataset)?

@arthurBarthe
Copy link
Collaborator

Hi Marion,

I think one good notebook to start with would be generate-figure-1.ipynb or _generate-figure-2.ipynb. Those should only require data from mlflow runs for the data processing step. Later on we will need to look at test_global_control.ipynb but this requires to run a test step (the notebook does not run the test, it is only to produce figures based on the output from the test step).

I would start with generate-figure-2.ipynb, this is probably the simplest one: it requires in the notebook to select an experiment (should be the one used in the data processing step) and a run_id from which to load the processed data, and the rest is just about making the plot using this data.

@MarionBWeinzierl
Copy link
Collaborator

Thank you!

That helps me a bit further.

I just noticed that in generate_figure-2, at the bottom it calls plot_training_subdomains, but we do not do any training with the data here, as you said. Function fails this line:

data_ids = run_params["source.run_id"].split("/")
, potentially because there is only one data set, or because there is no training. Do you know which of the two it is?

@arthurBarthe
Copy link
Collaborator

Ah, I think I know where the issue is coming from. Basically we want to plot the "rectangular" subdomains used for training. The way this information is stored has changed, and can instead be accessed by reading the file training_subdomains.yaml.

As for generate-figure1.ipynb, if you want to run it, you will need two data runs: one run with the CO2 parameter equal to 0, and another one with the CO2 parameter equal to 1.

@dorchard

This comment was marked as resolved.

@MarionBWeinzierl
Copy link
Collaborator

I managed to reproduce this figure:

image

There are still a couple of crutches in the code which need cleaning up, some special cases which need catching, and some docs/instructions I will need to write in/for the notebook.

Also, I only checked whether this is the correct figure by comparing it visually with the (contorted) figure 1b in the paper.

@arthurBarthe
Copy link
Collaborator

This looks good, thanks @MarionBWeinzierl ! For now "eye checking" should be fine, we will compare numerical metrics in the inference / test step. As for the table, I don't think we need to worry about it, we can probably remove that from the code.

@MarionBWeinzierl
Copy link
Collaborator

MarionBWeinzierl commented Aug 18, 2023

Is it correct that only figure 1b is plotted, or should 1a also be plotted, @arthurBarthe ?

@MarionBWeinzierl
Copy link
Collaborator

I created #82 for this issue as a draft PR which tracks the branch.

@MarionBWeinzierl
Copy link
Collaborator

@arthurBarthe , in notebook 'generate-paper-figure-6', (which was,in the above comment before the renaming, 'generate-figure-1'), I can either get
image
by handing the script the version is CO2=1 first and CO2=0 second, or
image
by doing it the other way around.

Comparing with figure 6a from the paper, I think that the second version is right, but I am not sure.

@MarionBWeinzierl
Copy link
Collaborator

I also get a runtime warning: gz21_ocean_momentum/venv/lib/python3.11/site-packages/dask/array/numpy_compat.py:43: RuntimeWarning: invalid value encountered in divide x = np.divide(x1, x2, out) (no info on where in the code), but it seems not to affect the run

@MarionBWeinzierl
Copy link
Collaborator

@arthurBarthe , I am now starting to look at 'test-global-control'. You said that a test step has to be run before that. What do you mean by that - is that an inference run?

@arthurBarthe
Copy link
Collaborator

Yes it should be CO2=0 first and CO2=1 second. As for the inference step, I am working on it: I run the data step and the training step, so now I should be able to try and run the inference step.

@arthurBarthe
Copy link
Collaborator

Quick question, in order to produce those figures, what is the size of the data you are using along the time dimension?

@MarionBWeinzierl
Copy link
Collaborator

Quick question, in order to produce those figures, what is the size of the data you are using along the time dimension?

I am using the call from the README, which has ntimes=100

@arthurBarthe
Copy link
Collaborator

Ah, we should be using the whole dataset, which I believe has more or less 3000 time points. I ran the data step on the HPC here with that parameter, with 4 nodes (each 8GB memory) and it ran in 30 minutes. This should explain the remaining difference with the figure from the paper. I should also update the readme accordingly.

@MarionBWeinzierl
Copy link
Collaborator

We might not need to update the repository readme, as this is a sufficiently small problem to run on a laptop and a nice starter problem.

Rather, we should add a comment in the notebook, or maybe better the readme in the notebook folder, and define there which calls to run to create the data for the paper figures.

@MarionBWeinzierl
Copy link
Collaborator

@MarionBWeinzierl
Copy link
Collaborator

As for the inference step, I am working on it: I run the data step and the training step, so now I should be able to try and run the inference step.

Is that then all the data which is need for test-global-control?

Which figure from the paper does that notebook generate?

@arthurBarthe
Copy link
Collaborator

We need data from the inference step, where we run the trained neural network (from the train step) on both data from CO2=0 and CO2=1 datasets. Then that should be it yes, and it should produce figures 4, 5, 7 I believe, depending on which run ids we request in the notebook.

@MarionBWeinzierl
Copy link
Collaborator

I added the information about which data to use etc to the jupyter notebook readme on the notebooks-cleanup branch.

@MarionBWeinzierl
Copy link
Collaborator

I fixed the test_global_control script (will need a bit of clean-up now).

These are the plots run with a reduced set with ntimes=100: plots.zip

Due to the reduced ntimes, I needed to also adapt some parameters in the plots for this.

@arthurBarthe , could you have a look and see whether they make roughly sense?

Also @arthurBarthe , can I delete the parts that say "IGNORE THIS" in test_global_control?

@MarionBWeinzierl
Copy link
Collaborator

Another question @arthurBarthe : You said I need the data for CO2=1 and CO2=0 for the test_global_control script. However, it is just asking for data once. Does that mean that, for reproducing the paper figures, I have to run this notebook twice, once with each setting?

Also, in the plot I have currently set the "merge" parameter to "none", and it did not ask for the separate data and inference steps. Do I need to have separately named experiments, so that I hand in those as tuples in the "merge" parameter, in order for the rest to run through as expected?

@MarionBWeinzierl
Copy link
Collaborator

@arthurBarthe , how does the data need to be organised for test_global_control?

@MarionBWeinzierl
Copy link
Collaborator

I noticed that in the pushed notebook on main, the experiment selected is called "data-global". Does that mean that for test_global_control the parameter global in the running of the data script should be set to 1?

@MarionBWeinzierl
Copy link
Collaborator

MarionBWeinzierl commented Sep 29, 2023

OK, we have established that the notebook has to be run twice, once with CO2=1 and once with CO2=1. Global control can stay zero. I am updating the readme accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants