Feature request: Use satellite data stored at AWS #35

raybellwaves · 2020-08-13T14:28:59Z

@cgentemann has an example of how to access GOES data via AWS
https://github.com/oceanhackweek/ohw20-tutorials/blob/master/10-satellite-data-access/Access_cloud_SST_data_examples.ipynb

This is also related to pytroll/satpy#1287

cgentemann · 2020-08-13T14:50:56Z

Thanks! two things ---
1 - the GOES netCDF file formats are a bit of a mess, we will be updating the code later today to clean it up a bit as we read it in. @pahdsn is working on this.
2 - the netCDF files are slow -- accessing a day takes about 3 min. I'm hoping soon they will be in Zarr. The difference between accessing them in netCDF versus Zarr is a bit striking. https://nbviewer.jupyter.org/github/oceanhackweek/ohw20-tutorials/blob/master/10-satellite-data-access/goes-cmp-netcdf-zarr.ipynb

If you go to https://github.com/oceanhackweek/ohw20-tutorials you can run the example yourself with the Binder link at the bottom.

djhoese · 2020-08-17T18:47:46Z

@cgentemann I'm rereading that goes-cmp-netcdf-zarr example, any idea what the chunk size defaulted to for the netcdf files?

cgentemann · 2020-08-17T20:12:20Z

as far as I can tell, the original netcdf file has no internal chunking == each netcdf file has 1 chunk == 1x5424x5424

djhoese · 2020-08-17T20:17:05Z

Ah ok so that matches with the zarr dataset. I was curious if the chunk size was playing a role in the timing at all. Looks like it is mostly just data access. Thanks.

cgentemann · 2020-08-17T20:21:53Z

yes, I'm actually hoping someone might jump in here with an explanation. We didn't change the chunking on purpose, to make the comparison as much apple-to-apple. The decrease in initial access time makes sense because now all the metadata is consolidated. The decrease in the analysis time I'm not sure I understand - maybe it has something to do with zarr concurrent reads?

also, i've generalized the read routine to read all the goes aws data (not just SST). I'll post a link in a day or two. No power here right now.

pnuu · 2020-08-18T05:20:12Z

Very interesting test!

What is the chunking of the zarr data? My guess is that the (possible) native chunking in the zarr version speeds up the processing as less data are downloaded for the sub-region cropped from the full data.

Could you also time the it takes to run the fs.glob() calls for the NetCDF version? I've never used S3, but have heard that these "filesystem" operations can be rather slow. Or are there other parts in get_geo_data() that causes most of the slowness? Timing shorter segments of that function would be very interesting to see what's the real bottleneck.

djhoese · 2020-08-18T12:00:57Z

What is the chunking of the zarr data?

The chunk size is the same as the netcdf (1x5424x5424).

raybellwaves · 2020-09-06T19:00:45Z

I created an end-to-end example here:
https://gist.github.com/raybellwaves/4dd2f1472468e9f67424b6a148e9ac18

It could be improved upon an added to the repo to supplement other Himawari examples:
https://github.com/pytroll/pytroll-examples/blob/master/satpy/HRIT%20AHI%2C%20Hurricane%20Trami.ipynb
https://github.com/pytroll/pytroll-examples/blob/master/satpy/ahi_true_color_pyspectral.ipynb
Those could also be updated if there data is available on AWS.

The gist could be updated by making a dir, downloading data, saving the fig then deleting the downloaded data.

The next thing to test would be 'streaming' the data to avoid having to download the data locally.

In addition, one thing I would be interested in - could slot on the end of this example - is how to save a true color image of the full disk as a e-mailable size limit (< 20 Mb) e.g. there was chat in the slack about using tiled=True when saving as a geotiff (https://pytroll.slack.com/archives/C0LNH7LMB/p1599313293263100)

djhoese · 2020-09-06T20:02:53Z

@raybellwaves Very nice. A couple things:

Recently @gerritholl added the ability to pass a file system object to satpy's find_files_and_readers. This may simplify or provide a different style of globbing for files on an S3 store.
Recently the NetCDF C library was updated by Ryan May to allow for #mode=bytes on HTTP URLs so the library can do byte range requests. This works for S3 backends too. I haven't made the pull request yet but posted about it in the satpy channel on slack:

--- satpy/readers/yaml_reader.py	(revision 0de817e6d4599e971724affc9f719f9aebc41ff8)
+++ satpy/readers/yaml_reader.py	(date 1599314347246)
@@ -69,6 +69,9 @@
     """Get the end of *path* of same length as *pattern*."""
     # convert any `/` on Windows to `\\`
     path = os.path.normpath(path)
+    # remove possible #mode=bytes URL suffix to support HTTP byte range
+    # requests for NetCDF
+    path = path.split('#')[0]
     # A pattern can include directories
     tail_len = len(pattern.split(os.path.sep))
     return os.path.join(*str(path).split(os.path.sep)[-tail_len:])

In [5]: url = "https://noaa-goes16.s3.amazonaws.com/ABI-L1b-RadC/2019/001/00/OR_ABI-L1b-RadC-M3C14_G16_s20190010002187_e20190010004560_c20190010005009.nc#mode=bytes"
In [6]: scn = Scene(reader='abi_l1b', filenames=[url])
In [7]: scn.load(['C14'])
  proj_string = self.to_proj4()
In [8]: scn.show('C14')
Out[8]: <trollimage.xrimage.XRImage at 0x7f444e7651d0>

I'm not saying we can't incorporate your usage directly, but might be nice with the rest of your suggestions to include something like this where the files don't have to be downloaded to disk.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: Use satellite data stored at AWS #35

Feature request: Use satellite data stored at AWS #35

raybellwaves commented Aug 13, 2020

cgentemann commented Aug 13, 2020 •

edited

Loading

djhoese commented Aug 17, 2020

cgentemann commented Aug 17, 2020

djhoese commented Aug 17, 2020

cgentemann commented Aug 17, 2020

pnuu commented Aug 18, 2020

djhoese commented Aug 18, 2020

raybellwaves commented Sep 6, 2020 •

edited

Loading

djhoese commented Sep 6, 2020

Feature request: Use satellite data stored at AWS #35

Feature request: Use satellite data stored at AWS #35

Comments

raybellwaves commented Aug 13, 2020

cgentemann commented Aug 13, 2020 • edited Loading

djhoese commented Aug 17, 2020

cgentemann commented Aug 17, 2020

djhoese commented Aug 17, 2020

cgentemann commented Aug 17, 2020

pnuu commented Aug 18, 2020

djhoese commented Aug 18, 2020

raybellwaves commented Sep 6, 2020 • edited Loading

djhoese commented Sep 6, 2020

cgentemann commented Aug 13, 2020 •

edited

Loading

raybellwaves commented Sep 6, 2020 •

edited

Loading