Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Use satellite data stored at AWS #35

Open
raybellwaves opened this issue Aug 13, 2020 · 9 comments
Open

Feature request: Use satellite data stored at AWS #35

raybellwaves opened this issue Aug 13, 2020 · 9 comments

Comments

@raybellwaves
Copy link

@cgentemann has an example of how to access GOES data via AWS
https://github.com/oceanhackweek/ohw20-tutorials/blob/master/10-satellite-data-access/Access_cloud_SST_data_examples.ipynb

This is also related to pytroll/satpy#1287

@cgentemann
Copy link

cgentemann commented Aug 13, 2020

Thanks! two things ---
1 - the GOES netCDF file formats are a bit of a mess, we will be updating the code later today to clean it up a bit as we read it in. @pahdsn is working on this.
2 - the netCDF files are slow -- accessing a day takes about 3 min. I'm hoping soon they will be in Zarr. The difference between accessing them in netCDF versus Zarr is a bit striking. https://nbviewer.jupyter.org/github/oceanhackweek/ohw20-tutorials/blob/master/10-satellite-data-access/goes-cmp-netcdf-zarr.ipynb

If you go to https://github.com/oceanhackweek/ohw20-tutorials you can run the example yourself with the Binder link at the bottom.

@djhoese
Copy link
Member

djhoese commented Aug 17, 2020

@cgentemann I'm rereading that goes-cmp-netcdf-zarr example, any idea what the chunk size defaulted to for the netcdf files?

@cgentemann
Copy link

as far as I can tell, the original netcdf file has no internal chunking == each netcdf file has 1 chunk == 1x5424x5424

@djhoese
Copy link
Member

djhoese commented Aug 17, 2020

Ah ok so that matches with the zarr dataset. I was curious if the chunk size was playing a role in the timing at all. Looks like it is mostly just data access. Thanks.

@cgentemann
Copy link

yes, I'm actually hoping someone might jump in here with an explanation. We didn't change the chunking on purpose, to make the comparison as much apple-to-apple. The decrease in initial access time makes sense because now all the metadata is consolidated. The decrease in the analysis time I'm not sure I understand - maybe it has something to do with zarr concurrent reads?

also, i've generalized the read routine to read all the goes aws data (not just SST). I'll post a link in a day or two. No power here right now.

@pnuu
Copy link
Member

pnuu commented Aug 18, 2020

Very interesting test!

What is the chunking of the zarr data? My guess is that the (possible) native chunking in the zarr version speeds up the processing as less data are downloaded for the sub-region cropped from the full data.

Could you also time the it takes to run the fs.glob() calls for the NetCDF version? I've never used S3, but have heard that these "filesystem" operations can be rather slow. Or are there other parts in get_geo_data() that causes most of the slowness? Timing shorter segments of that function would be very interesting to see what's the real bottleneck.

@djhoese
Copy link
Member

djhoese commented Aug 18, 2020

What is the chunking of the zarr data?

The chunk size is the same as the netcdf (1x5424x5424).

@raybellwaves
Copy link
Author

raybellwaves commented Sep 6, 2020

I created an end-to-end example here:
https://gist.github.com/raybellwaves/4dd2f1472468e9f67424b6a148e9ac18

It could be improved upon an added to the repo to supplement other Himawari examples:
https://github.com/pytroll/pytroll-examples/blob/master/satpy/HRIT%20AHI%2C%20Hurricane%20Trami.ipynb
https://github.com/pytroll/pytroll-examples/blob/master/satpy/ahi_true_color_pyspectral.ipynb
Those could also be updated if there data is available on AWS.

The gist could be updated by making a dir, downloading data, saving the fig then deleting the downloaded data.

The next thing to test would be 'streaming' the data to avoid having to download the data locally.

In addition, one thing I would be interested in - could slot on the end of this example - is how to save a true color image of the full disk as a e-mailable size limit (< 20 Mb) e.g. there was chat in the slack about using tiled=True when saving as a geotiff (https://pytroll.slack.com/archives/C0LNH7LMB/p1599313293263100)

@djhoese
Copy link
Member

djhoese commented Sep 6, 2020

@raybellwaves Very nice. A couple things:

  1. Recently @gerritholl added the ability to pass a file system object to satpy's find_files_and_readers. This may simplify or provide a different style of globbing for files on an S3 store.
  2. Recently the NetCDF C library was updated by Ryan May to allow for #mode=bytes on HTTP URLs so the library can do byte range requests. This works for S3 backends too. I haven't made the pull request yet but posted about it in the satpy channel on slack:
--- satpy/readers/yaml_reader.py	(revision 0de817e6d4599e971724affc9f719f9aebc41ff8)
+++ satpy/readers/yaml_reader.py	(date 1599314347246)
@@ -69,6 +69,9 @@
     """Get the end of *path* of same length as *pattern*."""
     # convert any `/` on Windows to `\\`
     path = os.path.normpath(path)
+    # remove possible #mode=bytes URL suffix to support HTTP byte range
+    # requests for NetCDF
+    path = path.split('#')[0]
     # A pattern can include directories
     tail_len = len(pattern.split(os.path.sep))
     return os.path.join(*str(path).split(os.path.sep)[-tail_len:])
In [5]: url = "https://noaa-goes16.s3.amazonaws.com/ABI-L1b-RadC/2019/001/00/OR_ABI-L1b-RadC-M3C14_G16_s20190010002187_e20190010004560_c20190010005009.nc#mode=bytes"
In [6]: scn = Scene(reader='abi_l1b', filenames=[url])
In [7]: scn.load(['C14'])
  proj_string = self.to_proj4()
In [8]: scn.show('C14')
Out[8]: <trollimage.xrimage.XRImage at 0x7f444e7651d0>

I'm not saying we can't incorporate your usage directly, but might be nice with the rest of your suggestions to include something like this where the files don't have to be downloaded to disk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants