Support async reading without file's content length #159

kylebarron · 2022-07-09T19:45:40Z

Currently, any async reading using parquet2 requires knowing the content length of the remote resource, such as:

Lines 21 to 22 in 7be3cd6

    
           let (data, _) = bucket.head_object(&path).await.unwrap(); 
        
           let length = data.content_length.unwrap() as usize;

However, for any API that follows the Range HTTP request header spec, knowing the content length of the file in advance is unnecessary because:

negative ranges (counting from the end) are supported by the spec, i.e. Range=-4096 will fetch the last 4096 bytes of the file
The last bytes of the Parquet footer contains the total number of bytes in the footer. Therefore after that initial metadata request, you know if you need another request before parsing the metadata. You should never need to know the total number of bytes in the file because every ColumnChunkMetaData contains the absolute start and end ranges of each column buffer.

In particular, the downside of needing to know the file's content length is an extra HEAD request, which in environments like a client-side browser could have significant latency.

Do you have opinions on an API where the content length is not needed (or at least is optional)? I see that the AsyncSeek trait that is required on the reader depends on the SeekFrom enum, which includes End(i64). Therefore it would seem to me that the existing AsyncSeek trait would be enough to use only negative ranges...? Or am I missing something?

The text was updated successfully, but these errors were encountered:

kylebarron · 2022-07-09T19:55:42Z

The one thing I haven't been able to figure out yet from research is whether a negative range that is bigger than the total bytes of the file is a problem. So if you request Range=-10000, but there are only 5000 bytes in the file, does the spec indicate that a server should respond with those 5000 bytes? Or with an error?

I'd hope it's the former, which would solve the issue you check for here:

parquet2/src/read/stream.rs

Line 39 in 34aac65

    
           let default_end_len = std::cmp::min(DEFAULT_FOOTER_READ_SIZE, file_size) as usize;

where you try not to read more bytes than exist in the file.

kylebarron · 2022-07-09T20:07:55Z

I think this part of the spec declares that a suffix-length that is larger than the total size of the file will return the entire file, without erroring.

A client can request the last N bytes of the selected representation
using a suffix-byte-range-spec.
 suffix-byte-range-spec = "-" suffix-length
 suffix-length = 1*DIGIT
If the selected representation is shorter than the specified
suffix-length, the entire representation is used.

jorgecarleitao · 2022-07-11T15:35:45Z

Ooof, this is very interesting, thanks for sharing it!

I agree that AsyncRead + AsyncSeek is sufficient to achieve what we want here.

We can remove the calls of stream_len on reading metadata here and here so that we no longer require the file size.

Then, we can make RangedStreamer not require length and modify the math about interval intersection to still be applicable when positions are negative (or always request a new range if that is not possible).

Do you think that that could work?

kylebarron · 2022-07-11T16:02:42Z

That aligns well with what I had in mind!

jorgecarleitao · 2022-08-04T17:10:30Z

relevant thread: https://users.rust-lang.org/t/how-to-avoid-oom-when-reading-untrusted-sources/79263/12

the tl;dr is that if we do not have the file size, we need some mechanism to allow the user to avoid zip bombs and the like (not blocking this, just a piece of information) ^^

kylebarron · 2022-08-04T17:52:07Z

That's so interesting. I haven't read through the whole thread yet, but the worry is that a malicious Parquet source sets the last 4 bytes to be the u32 max value? And then requesting 4GB of data from the server leads to OOM?

I wonder how pyarrow handles this. I haven't looked through their code thoroughly but I thought they handled a similar approach

kylebarron · 2022-08-29T22:27:44Z

Not sure if this is relevant to this issue specifically or would be better discussed in a different issue, but maybe some sort of integration with the newly-public object_store crate would be helpful to simplify async reading from remote sources:

If I'm reading it correctly, it's basically https://filesystem-spec.readthedocs.io/en/latest/ for rust

kszlim · 2022-11-28T08:19:20Z

Wondering if there are any updates on this, a shame that object_store doesn't implement AsyncSeek et al. (not sure if this would be possible).

But ideally, the common interface between this crate and any fsspec-like crate could be AsyncSeek + AsyncRead and Read + Seek.

jorgecarleitao added the enhancement New feature or request label Jul 11, 2022

kylebarron mentioned this issue Jan 28, 2023

feat(rust,python): Enable object store in scan_parquet python pola-rs/polars#6426

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support async reading without file's content length #159

Support async reading without file's content length #159

kylebarron commented Jul 9, 2022 •

edited

Loading

kylebarron commented Jul 9, 2022

kylebarron commented Jul 9, 2022

jorgecarleitao commented Jul 11, 2022

kylebarron commented Jul 11, 2022

jorgecarleitao commented Aug 4, 2022

kylebarron commented Aug 4, 2022

kylebarron commented Aug 29, 2022 •

edited

Loading

kszlim commented Nov 28, 2022

Support async reading without file's content length #159

Support async reading without file's content length #159

Comments

kylebarron commented Jul 9, 2022 • edited Loading

kylebarron commented Jul 9, 2022

kylebarron commented Jul 9, 2022

jorgecarleitao commented Jul 11, 2022

kylebarron commented Jul 11, 2022

jorgecarleitao commented Aug 4, 2022

kylebarron commented Aug 4, 2022

kylebarron commented Aug 29, 2022 • edited Loading

kszlim commented Nov 28, 2022

kylebarron commented Jul 9, 2022 •

edited

Loading

kylebarron commented Aug 29, 2022 •

edited

Loading