Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support async reading without file's content length #159

Open
kylebarron opened this issue Jul 9, 2022 · 8 comments
Open

Support async reading without file's content length #159

kylebarron opened this issue Jul 9, 2022 · 8 comments
Labels
enhancement New feature or request

Comments

@kylebarron
Copy link
Contributor

kylebarron commented Jul 9, 2022

Currently, any async reading using parquet2 requires knowing the content length of the remote resource, such as:

let (data, _) = bucket.head_object(&path).await.unwrap();
let length = data.content_length.unwrap() as usize;

However, for any API that follows the Range HTTP request header spec, knowing the content length of the file in advance is unnecessary because:

  1. negative ranges (counting from the end) are supported by the spec, i.e. Range=-4096 will fetch the last 4096 bytes of the file
  2. The last bytes of the Parquet footer contains the total number of bytes in the footer. Therefore after that initial metadata request, you know if you need another request before parsing the metadata. You should never need to know the total number of bytes in the file because every ColumnChunkMetaData contains the absolute start and end ranges of each column buffer.

In particular, the downside of needing to know the file's content length is an extra HEAD request, which in environments like a client-side browser could have significant latency.

Do you have opinions on an API where the content length is not needed (or at least is optional)? I see that the AsyncSeek trait that is required on the reader depends on the SeekFrom enum, which includes End(i64). Therefore it would seem to me that the existing AsyncSeek trait would be enough to use only negative ranges...? Or am I missing something?

@kylebarron
Copy link
Contributor Author

The one thing I haven't been able to figure out yet from research is whether a negative range that is bigger than the total bytes of the file is a problem. So if you request Range=-10000, but there are only 5000 bytes in the file, does the spec indicate that a server should respond with those 5000 bytes? Or with an error?

I'd hope it's the former, which would solve the issue you check for here:

let default_end_len = std::cmp::min(DEFAULT_FOOTER_READ_SIZE, file_size) as usize;
where you try not to read more bytes than exist in the file.

@kylebarron
Copy link
Contributor Author

I think this part of the spec declares that a suffix-length that is larger than the total size of the file will return the entire file, without erroring.

A client can request the last N bytes of the selected representation
using a suffix-byte-range-spec.

 suffix-byte-range-spec = "-" suffix-length
 suffix-length = 1*DIGIT

If the selected representation is shorter than the specified
suffix-length, the entire representation is used.

@jorgecarleitao
Copy link
Owner

Ooof, this is very interesting, thanks for sharing it!

I agree that AsyncRead + AsyncSeek is sufficient to achieve what we want here.

We can remove the calls of stream_len on reading metadata here and here so that we no longer require the file size.

Then, we can make RangedStreamer not require length and modify the math about interval intersection to still be applicable when positions are negative (or always request a new range if that is not possible).

Do you think that that could work?

@jorgecarleitao jorgecarleitao added the enhancement New feature or request label Jul 11, 2022
@kylebarron
Copy link
Contributor Author

That aligns well with what I had in mind!

@jorgecarleitao
Copy link
Owner

relevant thread: https://users.rust-lang.org/t/how-to-avoid-oom-when-reading-untrusted-sources/79263/12

the tl;dr is that if we do not have the file size, we need some mechanism to allow the user to avoid zip bombs and the like (not blocking this, just a piece of information) ^^

@kylebarron
Copy link
Contributor Author

That's so interesting. I haven't read through the whole thread yet, but the worry is that a malicious Parquet source sets the last 4 bytes to be the u32 max value? And then requesting 4GB of data from the server leads to OOM?

I wonder how pyarrow handles this. I haven't looked through their code thoroughly but I thought they handled a similar approach

@kylebarron
Copy link
Contributor Author

kylebarron commented Aug 29, 2022

Not sure if this is relevant to this issue specifically or would be better discussed in a different issue, but maybe some sort of integration with the newly-public object_store crate would be helpful to simplify async reading from remote sources:

If I'm reading it correctly, it's basically https://filesystem-spec.readthedocs.io/en/latest/ for rust

@kszlim
Copy link

kszlim commented Nov 28, 2022

Wondering if there are any updates on this, a shame that object_store doesn't implement AsyncSeek et al. (not sure if this would be possible).

But ideally, the common interface between this crate and any fsspec-like crate could be AsyncSeek + AsyncRead and Read + Seek.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants