Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introductory Work - up to parallel processing of latitudes #5

Closed

Conversation

lthUniBonn
Copy link

First attempt at a pull request - this should include my working solutions to #3 and #4.

#3 is replaced mainly in the plot_maps (xarray requires slightly different array handling)

  • in the processing only the output formats are changed to netCDF4
  • adaptive plotting is not included yet, so the axis scaling is still to be adapted manually, some of my first changes on the scaling are sadly also included here, but it still works just as before so I hope it's fine until I get to change the plotting fully

#4 the parallel_processing.. uses command line arguments to chose the latitude subsets, output files are saved in /lats subfolder

  • in principle also the previous handling of all latitudes one after the other is still included

From the beginning of my introduction a slight change in the readme is included, as one sentence gave some outdated info on the downloading process I think.

Also the configuration changes for the now finishing 'Christmas Download' is included, which I also think should be good, as it is supposed to be the working configuration from now on. (e.g. grid size, number of downloaded levels, ..)

I think I'll try to separate the work in more branches in the future. Any feedback is welcome!
Cheers,
Lavinia

lthUniBonn added 18 commits December 6, 2020 20:52
… to different heghts - needs fixing, autmatisation maybe
…ded in data with variable height calculation, will be downloaded
…uding the changes in data-array handling, plotting still not adaptive
… processing via program arguments, latitude files saved individually in /lats directory
…wever full functionality of this code now also included in parallel_process_data.py
Copy link
Contributor

@markschelbergen markschelbergen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the flow for parallel processing. I provided comments on the pieces of codes that I have doubts about. However, let's discuss in a Skype call such that you can explain me your rational better.

config.py Outdated
Comment on lines 30 to 32
era5_data_dir = '/cephfs/user/s6lathim/ERA5Data-112/'
model_level_file_name_format = "{:d}_europe_{:d}_130_131_132_133_135.nc" # 'ml_{:d}_{:02d}.netcdf'
surface_file_name_format = "{:d}_europe_{:d}_152.nc" # 'sfc_{:d}_{:02d}.netcdf'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put back the old data dir and file names

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

taken care of - commit e3b9ba8

config.py Outdated
output_file_name = "/cephfs/user/s6lathim/ERA5Data/results/lats/processed_data_europe_{:d}_{:d}.nc".format(start_year, final_year)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put back the old data dir and file names

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

taken care of (same as above) - commit e3b9ba8

plot_maps.py Outdated
@@ -8,6 +8,7 @@

"""
from netCDF4 import Dataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

netcdf4 library is not needed anymore, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, forgot to remove that, it is included in xarray.
Removed in commit 12e3200

$ python process_data.py

"""
from netCDF4 import Dataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still uses netCDF4

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it does, to write the output - do you think it would be better to change all of that to xarray, too?

That would mean to change the handling of the res dictionary and I think also the whole definition of the output_variables dimensions and so on to using an xarray dataset
then this dataset would still be written to a netcdf4 file

so I think the only merit would be a possibly prettier coding, as the res dictionary could maybe be modified to be an xarray dataset, then just write that...? I'm still getting into the use of xarray in this case, but I think that would be the workflow

process_data.py Outdated
@@ -19,10 +19,10 @@
from config import start_year, final_year, era5_data_dir, model_level_file_name_format, surface_file_name_format,\
output_file_name, read_n_lats_at_once

# Set the relevant heights for the different analysis types.
# Set the relevant heights for the different analysis types in meter.#, 1500.],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove end of sentence

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the second comment there, see commit 76e0cf4

plot_maps.py Outdated
Comment on lines 37 to 62
# user input
if len(sys.argv) > 1:
try:
opts, args = getopt.getopt(sys.argv[1:], "hl", ["help", "latitudes"])
except getopt.GetoptError:
print ("plot_maps.py -l >> process individual latitude files \n -h >> display help")
sys.exit()
for opt, arg in opts:
if opt in ("-h", "--help"):
print("plot_maps.py -l >> process individual latitude files \n -h >> display help")
sys.exit()
elif opt in ("-l", "--latitudes"):
# find all latitude files matching the output_file_name specified in config.py
output_file_name_short = output_file_name.split('/')[-1]
output_dir = output_file_name[:-len(output_file_name_short)]
output_file_name_prefix = output_file_name_short.split('.')[0] + '_'

# sorting of provided latitudes included
lats = [f.split('_')[-1][:-(len(f.split('.')[-1])+1)] for f in os.listdir(output_dir) if f.split("/")[-1].split("lats")[0] == output_file_name_prefix]
lats.sort(key=float)
print(str(len(lats)) + ' latitudes found in directory ' + output_dir + ' for the years ' + str(start_year) + ' to ' + str(final_year) + ':')
print(lats)

resources = [output_dir + output_file_name_prefix + 'lats_' + lat + '.nc' for lat in lats]
print('All latitudes are read from files with the respective latitude in a similar fashion to: ' + resources[0])
nc = xr.open_mfdataset(resources, concat_dim='latitude')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add more comments, because it's hard to follow what exactly you are doing here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to subsets and greatly simplified by using one set input parameter, commit 141ec71

Comment on lines 582 to 593
try:
opts, args = getopt.getopt(sys.argv[1:], "hs:", ["help", "subset="])
except getopt.GetoptError:
print ("parallel_process_data.py -s SubsetNumber >> process individual latitude subset SubsetNumber \n -h >> display help ")
sys.exit()
for opt, arg in opts:
if opt in ("-h", "--help"):
print ("parallel_process_data.py -s SubsetNumber >> process individual latitude subset SubsetNumber \n -h >> display help ")
sys.exit()
elif opt in ("-s", "--subset"):
# Select only a specific latitude subset
subset_number_input = int(arg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add more comments explaining how the code works.

Copy link
Author

@lthUniBonn lthUniBonn Jan 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is conderned with reading the user input, I included a bit more info on what exactly the reading does here, hope it is more clear now (in addition in the first lines I included more detailed instructions as well) via commit fbb2624
Your questions seem to show also that the name of the program is quite misleading, which I changed as well (4963f4e and in the file 084657a)

return(res, counter)


def process_complete_grid(output_file, subset_number_input):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the function is called complete grid, but you have a new argument 'subset_number_input'. This suggests that the function is not for the full grid.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better description the subset interpratation was replaced in favor of a latitude & latitude index naming
in commit 68d38e1
Also the function naming has been replaced by more accurate job descriptions

Comment on lines 327 to 334
if subset_number_input == (-1):
i_subset = 0
elif subset_number_input >= n_subsets:
raise ValueError("User input subset number ({:.0f}) larger than total maximal subset index ({:.0f}) starting at 0.".format(subset_number_input, (n_subsets-1)))
else:
# User input given - processing only one latitude subset:
i_subset = subset_number_input
total_iters = len(lons)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take this outside this function

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why you ask that, I guess it might also be resolved by better function naming introduced in 68d38e1.




def process_lats_subset(lats_subset, lenLons, levels, lenHours, v_levels_east, v_levels_north, v_levels, t_levels, q_levels, surface_pressure, heights_of_interest, analyzed_heights_ids, res, start_time, counter, total_iters):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please rewrite such that each parallel process outputs a separate netcdf, which we concatenate in a post processing step

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already happening, if read_n_lats_at_once is 1 - I had though to keep that variable, but at the moment if its larger, files span multiple latitudes - Now I removed the dependency on read_n_lats_at_once and now just process all latitudes individually
54637c5 & 5dd45a1
Not sure if that's what you wanted though, or if its that best solution - so that we could talk about in Sykpe then.

@lthUniBonn lthUniBonn closed this Jan 20, 2021
@lthUniBonn lthUniBonn mentioned this pull request Jan 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants