-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introductory Work - up to parallel processing of latitudes #5
Introductory Work - up to parallel processing of latitudes #5
Conversation
…ng speed n_lats_at_once
… contains start and endyear
…ng speed n_lats_at_once
… to different heghts - needs fixing, autmatisation maybe
…ded in data with variable height calculation, will be downloaded
…uding the changes in data-array handling, plotting still not adaptive
… processing via program arguments, latitude files saved individually in /lats directory
…wever full functionality of this code now also included in parallel_process_data.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the flow for parallel processing. I provided comments on the pieces of codes that I have doubts about. However, let's discuss in a Skype call such that you can explain me your rational better.
config.py
Outdated
era5_data_dir = '/cephfs/user/s6lathim/ERA5Data-112/' | ||
model_level_file_name_format = "{:d}_europe_{:d}_130_131_132_133_135.nc" # 'ml_{:d}_{:02d}.netcdf' | ||
surface_file_name_format = "{:d}_europe_{:d}_152.nc" # 'sfc_{:d}_{:02d}.netcdf' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put back the old data dir and file names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
taken care of - commit e3b9ba8
config.py
Outdated
output_file_name = "/cephfs/user/s6lathim/ERA5Data/results/lats/processed_data_europe_{:d}_{:d}.nc".format(start_year, final_year) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put back the old data dir and file names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
taken care of (same as above) - commit e3b9ba8
plot_maps.py
Outdated
@@ -8,6 +8,7 @@ | |||
|
|||
""" | |||
from netCDF4 import Dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
netcdf4 library is not needed anymore, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, forgot to remove that, it is included in xarray.
Removed in commit 12e3200
parallel_process_data.py
Outdated
$ python process_data.py | ||
|
||
""" | ||
from netCDF4 import Dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still uses netCDF4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it does, to write the output - do you think it would be better to change all of that to xarray, too?
That would mean to change the handling of the res dictionary and I think also the whole definition of the output_variables dimensions and so on to using an xarray dataset
then this dataset would still be written to a netcdf4 file
so I think the only merit would be a possibly prettier coding, as the res dictionary could maybe be modified to be an xarray dataset, then just write that...? I'm still getting into the use of xarray in this case, but I think that would be the workflow
process_data.py
Outdated
@@ -19,10 +19,10 @@ | |||
from config import start_year, final_year, era5_data_dir, model_level_file_name_format, surface_file_name_format,\ | |||
output_file_name, read_n_lats_at_once | |||
|
|||
# Set the relevant heights for the different analysis types. | |||
# Set the relevant heights for the different analysis types in meter.#, 1500.], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove end of sentence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the second comment there, see commit 76e0cf4
plot_maps.py
Outdated
# user input | ||
if len(sys.argv) > 1: | ||
try: | ||
opts, args = getopt.getopt(sys.argv[1:], "hl", ["help", "latitudes"]) | ||
except getopt.GetoptError: | ||
print ("plot_maps.py -l >> process individual latitude files \n -h >> display help") | ||
sys.exit() | ||
for opt, arg in opts: | ||
if opt in ("-h", "--help"): | ||
print("plot_maps.py -l >> process individual latitude files \n -h >> display help") | ||
sys.exit() | ||
elif opt in ("-l", "--latitudes"): | ||
# find all latitude files matching the output_file_name specified in config.py | ||
output_file_name_short = output_file_name.split('/')[-1] | ||
output_dir = output_file_name[:-len(output_file_name_short)] | ||
output_file_name_prefix = output_file_name_short.split('.')[0] + '_' | ||
|
||
# sorting of provided latitudes included | ||
lats = [f.split('_')[-1][:-(len(f.split('.')[-1])+1)] for f in os.listdir(output_dir) if f.split("/")[-1].split("lats")[0] == output_file_name_prefix] | ||
lats.sort(key=float) | ||
print(str(len(lats)) + ' latitudes found in directory ' + output_dir + ' for the years ' + str(start_year) + ' to ' + str(final_year) + ':') | ||
print(lats) | ||
|
||
resources = [output_dir + output_file_name_prefix + 'lats_' + lat + '.nc' for lat in lats] | ||
print('All latitudes are read from files with the respective latitude in a similar fashion to: ' + resources[0]) | ||
nc = xr.open_mfdataset(resources, concat_dim='latitude') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add more comments, because it's hard to follow what exactly you are doing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to subsets and greatly simplified by using one set input parameter, commit 141ec71
parallel_process_data.py
Outdated
try: | ||
opts, args = getopt.getopt(sys.argv[1:], "hs:", ["help", "subset="]) | ||
except getopt.GetoptError: | ||
print ("parallel_process_data.py -s SubsetNumber >> process individual latitude subset SubsetNumber \n -h >> display help ") | ||
sys.exit() | ||
for opt, arg in opts: | ||
if opt in ("-h", "--help"): | ||
print ("parallel_process_data.py -s SubsetNumber >> process individual latitude subset SubsetNumber \n -h >> display help ") | ||
sys.exit() | ||
elif opt in ("-s", "--subset"): | ||
# Select only a specific latitude subset | ||
subset_number_input = int(arg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add more comments explaining how the code works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is conderned with reading the user input, I included a bit more info on what exactly the reading does here, hope it is more clear now (in addition in the first lines I included more detailed instructions as well) via commit fbb2624
Your questions seem to show also that the name of the program is quite misleading, which I changed as well (4963f4e and in the file 084657a)
parallel_process_data.py
Outdated
return(res, counter) | ||
|
||
|
||
def process_complete_grid(output_file, subset_number_input): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the function is called complete grid, but you have a new argument 'subset_number_input'. This suggests that the function is not for the full grid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better description the subset interpratation was replaced in favor of a latitude & latitude index naming
in commit 68d38e1
Also the function naming has been replaced by more accurate job descriptions
parallel_process_data.py
Outdated
if subset_number_input == (-1): | ||
i_subset = 0 | ||
elif subset_number_input >= n_subsets: | ||
raise ValueError("User input subset number ({:.0f}) larger than total maximal subset index ({:.0f}) starting at 0.".format(subset_number_input, (n_subsets-1))) | ||
else: | ||
# User input given - processing only one latitude subset: | ||
i_subset = subset_number_input | ||
total_iters = len(lons) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
take this outside this function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure why you ask that, I guess it might also be resolved by better function naming introduced in 68d38e1.
parallel_process_data.py
Outdated
|
||
|
||
|
||
def process_lats_subset(lats_subset, lenLons, levels, lenHours, v_levels_east, v_levels_north, v_levels, t_levels, q_levels, surface_pressure, heights_of_interest, analyzed_heights_ids, res, start_time, counter, total_iters): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please rewrite such that each parallel process outputs a separate netcdf, which we concatenate in a post processing step
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already happening, if read_n_lats_at_once is 1 - I had though to keep that variable, but at the moment if its larger, files span multiple latitudes - Now I removed the dependency on read_n_lats_at_once and now just process all latitudes individually
54637c5 & 5dd45a1
Not sure if that's what you wanted though, or if its that best solution - so that we could talk about in Sykpe then.
… getopt part (reading the command line arguments)
…el but is meant to be execued for every single latitude
…ining program use
…ten for each subset
First attempt at a pull request - this should include my working solutions to #3 and #4.
#3 is replaced mainly in the plot_maps (xarray requires slightly different array handling)
#4 the parallel_processing.. uses command line arguments to chose the latitude subsets, output files are saved in /lats subfolder
From the beginning of my introduction a slight change in the readme is included, as one sentence gave some outdated info on the downloading process I think.
Also the configuration changes for the now finishing 'Christmas Download' is included, which I also think should be good, as it is supposed to be the working configuration from now on. (e.g. grid size, number of downloaded levels, ..)
I think I'll try to separate the work in more branches in the future. Any feedback is welcome!
Cheers,
Lavinia