-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add reader for GERB high-resolution HDF5 files #2572
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job putting this together. It is at a really good starting point, but I'm hoping you could make some changes to clean it up a bit.
The CI checks are complaining about a couple things and I tried to address some of them in my comments/suggestions. You can click "Details" next to the pre-commit CI check and that will give you a better idea for what it is complaining about. You could also install pre-commit locally to have these checks happen when you make a commit.
I have two main concerns with the code otherwise:
- We'll need tests before we can merge this into Satpy. My second suggestion below might help with that since you could base your tests off of other similar readers.
- You may want to base your file handler off of the HDF5 utility handler (
satpy/satpy/readers/hdf5_utils.py
Line 35 in 68af9cd
class HDF5FileHandler(BaseFileHandler):
satpy/readers/gerb_l2_hr_h5.py
Outdated
def get_area_def(self, dsid): | ||
"""Area definition for the GERB product""" | ||
|
||
if abs(self.ssp_lon) < 1e-6: | ||
return get_area_def("msg_seviri_fes_9km") | ||
elif abs(self.ssp_lon - 9.5) < 1e-6: | ||
return get_area_def("msg_seviri_fes_9km") | ||
elif abs(self.ssp_lon - 45.5) < 1e-6: | ||
return get_area_def("msg_seviri_iodc_9km") | ||
else: | ||
raise ValueError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do the GERB files not have geolocation information internally? It'd be best if we didn't have to depend on the builtin Satpy areas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I didn't know it was preferred. I actually started using a proj config line, would this be better? There are several approaches in the existing readers.
In any case, the grid for GERB is fixed for the lifetime of the instrument, so the only variable parameter is the SSP longitude.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the grid for GERB is fixed for the lifetime of the instrument
If I had a dollar for every time I've heard something about an instrument isn't supposed to change...
I guess when I say preferred I mean it is my preference. Normally the data files provide their own geolocation information. Otherwise I would consider the files at fault. You're putting the data in the files, describe it. In some readers we've had to hardcode geolocation/area information but only as a last resort. I guess what you've done here is the best we're going to get.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle, we can provide a HDF5 file, and we do when releasing a full dataset. For users wishing to download a short time series, this will be problematic however :-/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the NetCDF CF realm of data files providing the projection information shouldn't be more than 10s or maybe 100 bytes (ex. grid_mapping
variables). There would also then need to be x
and y
variables for the pixel coordinates so there is some extra weight to adding that information.
What you've done here is fine.
@@ -0,0 +1,31 @@ | |||
reader: | |||
name: gerb_l2_hr_h5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a not "HR" version of these files? Are these "official" products? Are there other L2 products that aren't HDF5? I'm wondering if we could just call this reader gerb_l2
or gerb_l2_h5
or something shorter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice, the L2 HR (9km) product is the one we recommend to users. We have a so-called ARG L1.5 (50 km) product and a BARG L2 (50km). Those are the official products of the GERB instrument (RAL Space, Imperial College and the Royal Meteorological Institute of Belgium produce the data on behalf of EUMETSAT).
There might be an issue if/when we produce "edition 2" data in the future that might be in NetCDF.
I think that gerb_l2_h5
should be fine though :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that makes sense. If there is a chance of NetCDF versions in the future then maybe the h5
is the best for now. We could always try to throw support for all the files in a generic gerb_l2
reader, but maybe that's a little too optimistic about the similarities of the files (especially when some of them don't exist yet).
satpy/readers/gerb_l2_hr_h5.py
Outdated
ds_min = ds[...].min() | ||
if ds_min < 0: | ||
mask = ds == ds_min | ||
ds_real[mask] = np.nan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll mention other ideas for dask-ification in my main review comment, but wanted to point out that this is very inefficient and if/when it is done with dask arrays it will be even more inefficient. It was hard for me to understand the product guide regarding fill values, but it looks like the fill or "Error" value is always the minimum value of the in-file data type. Does that sound right?
In that case, can we hardcode this fill value in the YAML or look it up using something like np.iinfo(ds.dtype).min
. Or are there attributes in the file that tell us this information? That way we don't have to look at every pixel just to determine the minimum value that we use for masking. Also, why does the minimum value have to be < 0
to be used for masking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "error value" is not encoded in the file. I added a fill_value
entry in the yaml and used that. Note that to do that masking at the integer step I modified the logic a tiny bit.
There is one data field in the GERB file, the cloud cover, stored as "uint8" for which the fill value is 255 and the "min" trick cannot be used. This is not relevant for the fluxes though.
satpy/readers/gerb_l2_hr_h5.py
Outdated
@property | ||
def end_time(self): | ||
"""Get end time.""" | ||
return self.start_time + timedelta(minutes=14, seconds=59) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally this end_time
is the nominal time of the data and is usually pretty "human friendly". I'm guessing the repeat cycle of the instrument is 15 minutes? In that case could we make this 15 minutes instead of 14:59? Or in some cases we just make this equal to start_time
if we can't get it from the file data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would making the end time exactly 15 mins not present problems as it'll overlap with the start time of the next scan, though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simonrp84 that was my motivation (plus the fact that some other readers seem to do that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pdebuyl What other readers? I tried to find it when I made my comment, but I was pretty sure @mraspaud and I defined end_time
s in Satpy as being exclusive. So start/end times being every 15 minutes would be fine because any downstream application should be interpreting the end time as happening the instant before the time specified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By chance, I had a look at the hsaf_h5
reader (for inspiration as it is another HDF5 reader) and that one use 23 hours 59 minutes, hence my confusion. I'll put 15 minutes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like that was a suggestion from @pnuu here:
Perhaps we need to discuss this further. I know the FCI folks and I had trouble in our SIFT application when the nice human-friendly times weren't used because it was awkward to represent the blocks of time in a GUI...but maybe that was more about using real observation/scanning times rather than off-by-one inclusive times versus exclusive times.
Hi @djhoese thanks a lot for taking the time to review the reader. I'll proceed step by step, starting with linter issues. |
Codecov Report
@@ Coverage Diff @@
## main #2572 +/- ##
=======================================
Coverage 94.91% 94.92%
=======================================
Files 351 352 +1
Lines 51215 51212 -3
=======================================
Hits 48611 48611
+ Misses 2604 2601 -3
Flags with carried forward coverage won't be shown. Click here to find out more.
|
Pull Request Test Coverage Report for Build 6171738021
💛 - Coveralls |
Hey @djhoese I addressed several issues but not all. What remains, is the following:
There is timing data in the files as well, I could parse that if it is better for Satpy. I attach an example of the data below.
|
|
I think overall I'd like to see the use of the HDF5 utility file handler base class and tests added. |
Hi David, at this point I believe the most important is to add tests. Do you have a good example for me to check? |
@pdebuyl it depends if you're going to use the HDF5 utility class. You basically have two reasonable options when writing tests for readers:
Which ever way you'd like to go with it I can try to point you to more examples. |
I changed to the HDF5 base reader class, that was quite straightforward (at first I feared not being able to access Dataset attributes but that was fine). I'll check the tests, preferably with an actual file as the mock tests confuse me a bit (some extra abstraction there). |
Hello @djhoese I added a test file for the reader. The CI fails, but it is apparently related to a hdf/netcdf package installs under windows, should I care about it? Regarding the test, I create a real HDF5 containing fake data. I dumped an actual file in a hierarchy of dicts and used it to write code to re-write a file with a similar structure. If this is useful, I can share the code that I wrote for this (it does not take into account HDF5 references and custom dtypes). |
Merge with upstream I'll have to take a look at your code, but what you've said seems reasonable. |
6f157a3
to
7de9d80
Compare
@pytest.fixture(scope="session") | ||
def gerb_l2_hr_h5_dummy_file(tmp_path_factory): | ||
"""Create a dummy HDF5 file for the GERB L2 HR product.""" | ||
filename = tmp_path_factory.mktemp("data") / FNAME |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code quality checkers are a little upset with how big this function is. How much of this is used in the actual reading code? Is this a complete copy of the files structure? If it isn't all used you could remove it from the test. A lot of it also looks like it could go into a function (copy C_S1, set size, create variable, write, etc). Maybe even a for loop. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, this is code generated by a routine that will duplicate some instructions. A lot actually. And very little of the file content is currently used in the code. Basically, the reader loads the most used data and makes sure that the geolocation is correct.
Co-authored-by: David Hoese <[email protected]>
1. Use function for common HDF5 operations 2. Remove unused parts of the dummy file
I rebased with main, refactored a bit the code, removed a lot of the fake file content, and added the datasets I am ok with it but now I can't plot resampled data. I get weird errors like "'DataArray' object has no attribute 'reshape'" or "'DataArray' object has no attribute 'values'". This seemed to me like the data is not available to the scene anymore so I tested the diff below and it works again. Note that loading the data as a diff --git a/satpy/readers/gerb_l2_hr_h5.py b/satpy/readers/gerb_l2_hr_h5.py
index 4dad36f0e..bf89f3aa3 100644
--- a/satpy/readers/gerb_l2_hr_h5.py
+++ b/satpy/readers/gerb_l2_hr_h5.py
@@ -29,6 +29,7 @@ from datetime import timedelta
import dask.array as da
import h5py
+import numpy as np
import xarray as xr
from satpy.readers.hdf5_utils import HDF5FileHandler
@@ -43,7 +44,7 @@ def gerb_get_dataset(hfile, name, ds_info):
The routine takes into account the quantisation factor and fill values.
"""
- ds = xr.DataArray(hfile[name][...])
+ ds = hfile[name][...]
ds_attrs = hfile[name].attrs
ds_fill = ds_info['fill_value']
fill_mask = ds != ds_fill
@@ -51,7 +52,7 @@ def gerb_get_dataset(hfile, name, ds_info):
ds = ds*ds_attrs['Quantisation Factor']
else:
ds = ds*1.
- ds = ds.where(fill_mask)
+ ds = np.where(fill_mask, ds, np.nan)
return ds |
Can you give me a full traceback for one of those errors you're seeing? |
Sample code import satpy
import numpy as np
import matplotlib.pyplot as plt
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--region', default="FD")
parser.add_argument('--save', action="store_true")
args = parser.parse_args()
scene = satpy.Scene(reader="gerb_l2_hr_h5", filenames=["G1_SEV2_L20_HR_SOL_TH_20120621_101500_ED01.hdf"])
scene.load(['Thermal Flux', 'Solar Flux'])
if args.region != "FD":
local_scene = scene.resample(args.region)
else:
local_scene = scene
crs = local_scene['Thermal Flux'].attrs['area'].to_cartopy_crs()
ax = plt.axes(projection=crs); ax.coastlines(); ax.gridlines(); ax.set_global()
plt.imshow(local_scene['Thermal Flux'].to_numpy(), transform=crs, extent=crs.bounds, origin='upper', cmap=plt.cm.hot)
plt.colorbar(orientation='horizontal')
plt.title("GERB Thermal Flux [W/m²]")
if args.save:
plt.savefig(f'GERB_OLR_201206211015_{args.region}.png')
plt.clf()
else:
plt.figure()
ax = plt.axes(projection=crs)
ax.coastlines()
ax.gridlines()
ax.set_global()
plt.imshow(local_scene['Solar Flux'].to_numpy(), transform=crs, extent=crs.bounds, origin='upper', cmap=plt.cm.hot)
plt.colorbar(orientation='horizontal')
plt.title("GERB Solar Flux [W/m²]")
if args.save:
plt.savefig(f'GERB_RSW_201206211015_{args.region}.png')
else:
plt.show() invoked as |
satpy/readers/gerb_l2_hr_h5.py
Outdated
|
||
The routine takes into account the quantisation factor and fill values. | ||
""" | ||
ds = xr.DataArray(hfile[name][...]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one reason you're getting the dask errors and warnings you're seeing. The result of hfile[name]
(I think) should be an xarray.DataArray already. And don't do [...]
or you'll load it into memory which is not what we want anymore now that the reader is trying to be dask friendly. See my other comment for more info...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, got it. I am used to h5py and didn't realise that HDF5FileHandler would already take care of that. (+ had a look at the hsaf hdf5 reader that does something similar).
Hi @djhoese thanks for the hand-holding :-) This seems much cleaner now from the xarray dataset handling. Tests + example program pass on my side, I'll be happy to resolve any further issue if necessary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test file creation looks a lot better. There's still a lot of duplicate code there, but maybe I'm not looking close enough at the details. A function that takes the variable name, the array dtype, and the quantisation factor (always f64) could replace 4 or more variable creations, right? Even the ones that need Offset, they could call this additional helper function and then add Offset afterward.
If you're out of time and won't be able to get to this let me know.
I applied your recommendend simplification to the reader (get rid of Regarding the duplicate effort in the test, I don't know if it is worth it to do it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thanks.
Wiii, thanks @djhoese |
AUTHORS.md
if not there alreadyHi satpy,
I propose a reader for the L2 HR product of the GERB instrument. The data documentation is available here: https://gerb.oma.be/doku.php?id=documentation
I put three sample files here: https://gerb.oma.be/public/pdebuyl/sample_data/
I would be glad to see this included in satpy.
Quick background: GERB stands for "Geostationary Earth Radiation Budget". The project, for which I work, produces top of atmosphere fluxes in the reflected short wave domain and in the outgoing long wave domain (thermal).
The images are 1237 by 1237 pixels, which is the 9km (nadir) version of the SEVIRI grid. The instrument flies on MSG satellites.
I am happy to answer to any question of course and to improve my code to reach the requirements of satpy. Example of use: