Byte range selection performance #40

adair-kovac · 2022-02-01T19:06:09Z

adair-kovac
Feb 1, 2022

Hi @blaylockbk , I meant to at least write some benchmarks to verify and quantify this but it's been 2 months and I haven't done that so I'll just report an impression I got about the performance -

The byte range selection for the HRRR should be significantly faster than downloading the whole GRIB2 file, and I believe it is if you use the boto3 library. But from a place with decent network speed (so not my home wifi, yes the CHPC or an AWS EC2 node), it's actually faster to just download the whole GRIB2 file than to select a certain field using herbie. I'm guessing that's due to curl overhead, though my second guess is that it could be due to whatever the process for indexing into the grib file is.

I see from the code comments that you've thought about different ways of implementing the byte range selection, and I think it would be a good enhancement to herbie if that were reliably faster than downloading the whole file.

blaylockbk · 2022-02-02T14:39:43Z

blaylockbk
Feb 2, 2022
Maintainer

Hi @adair-kovac, thanks for sharing your thoughts on this. Yes, this is an aspect of Herbie that could be improved. The byte range request is implemented with curl because it was easy, and I haven't thought about it much since then.

One thing that hasn't worked is making multiple byte range requests in a single curl command (I think this is a limitation of the servers the data are located and not of curl). That is why curl is executed once for each grib message that is subset. It would be interesting to see if the performance is faster with boto3 or requests.

4 replies

jim-steenburgh Mar 19, 2022

@blaylockbk I have been dorking around with trying to download grib2 data from AWS and it does appear that multiple byte range requests are not allowed. This is old, but: https://stackoverflow.com/questions/19162723/amazon-s3-multiple-byte-range-request.

I could not get curl to work with AWS using multiple byte requests. The whole file is downloaded instead.

blaylockbk Mar 20, 2022
Maintainer

Hi @jim-steenburgh, thanks for sharing that link. I hadn't seen that before, but it confirms what I suspected. The current AWS docs says

Amazon S3 doesn't support retrieving multiple ranges of data per GET request.

To get multiple fields from one file, Herbie does a curl for each requested field and appends each range GET to the same file.

curl -s --range 10-20 https://s3.com/file.grib2 > ./file.grib2
curl -s --range 50-60 https://s3.com/file.grib2 >> ./file.grib2
curl -s --range 80-90 https://s3.com/file.grib2 >> ./file.grib2

jim-steenburgh Mar 20, 2022

Thanks Brian. That's a real limitation for some of the products I generate, which require hourly GFS output to 120 hrs and 3 hours to 168. Given the number of variables I pump into the machine learning algorithms, it will take me about 1 min to download each time if I have to grab a field at a time. It takes 30 sec to download the whole file, although even that's not fast enough as it would take me an hour just to download 120 hrs of GFS data. Ironically, when NOMADS is working correctly, I can get this data much faster due to the multirange and subsectoring capabilities that work on that server.

There must be another way to do this. Will continue to look around.

j0nes2k Aug 4, 2022

I can confirm that AWS does not allow multiple byte ranges. For my use case (with shell scripts though, evaluating Herbie at the moment) this is a bit slower than doing a call to NOMADS with multiple byte ranges.

What about the other cloud providers, google and azure - do they allow multiple byte ranges in one request?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Byte range selection performance #40

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Byte range selection performance #40

adair-kovac Feb 1, 2022

Replies: 1 comment · 4 replies

blaylockbk Feb 2, 2022 Maintainer

jim-steenburgh Mar 19, 2022

blaylockbk Mar 20, 2022 Maintainer

jim-steenburgh Mar 20, 2022

j0nes2k Aug 4, 2022

adair-kovac
Feb 1, 2022

Replies: 1 comment 4 replies

blaylockbk
Feb 2, 2022
Maintainer

blaylockbk Mar 20, 2022
Maintainer