-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support async reading without file's content length #159
Comments
The one thing I haven't been able to figure out yet from research is whether a negative range that is bigger than the total bytes of the file is a problem. So if you request I'd hope it's the former, which would solve the issue you check for here: Line 39 in 34aac65
|
I think this part of the spec declares that a
|
Ooof, this is very interesting, thanks for sharing it! I agree that We can remove the calls of Then, we can make Do you think that that could work? |
That aligns well with what I had in mind! |
relevant thread: https://users.rust-lang.org/t/how-to-avoid-oom-when-reading-untrusted-sources/79263/12 the tl;dr is that if we do not have the file size, we need some mechanism to allow the user to avoid zip bombs and the like (not blocking this, just a piece of information) ^^ |
That's so interesting. I haven't read through the whole thread yet, but the worry is that a malicious Parquet source sets the last 4 bytes to be the I wonder how pyarrow handles this. I haven't looked through their code thoroughly but I thought they handled a similar approach |
Not sure if this is relevant to this issue specifically or would be better discussed in a different issue, but maybe some sort of integration with the newly-public
If I'm reading it correctly, it's basically https://filesystem-spec.readthedocs.io/en/latest/ for rust |
Wondering if there are any updates on this, a shame that object_store doesn't implement AsyncSeek et al. (not sure if this would be possible). But ideally, the common interface between this crate and any fsspec-like crate could be AsyncSeek + AsyncRead and Read + Seek. |
Currently, any async reading using
parquet2
requires knowing the content length of the remote resource, such as:parquet2/examples/s3/src/main.rs
Lines 21 to 22 in 7be3cd6
However, for any API that follows the
Range
HTTP request header spec, knowing the content length of the file in advance is unnecessary because:Range=-4096
will fetch the last 4096 bytes of the fileColumnChunkMetaData
contains the absolute start and end ranges of each column buffer.In particular, the downside of needing to know the file's content length is an extra
HEAD
request, which in environments like a client-side browser could have significant latency.Do you have opinions on an API where the content length is not needed (or at least is optional)? I see that the
AsyncSeek
trait that is required on thereader
depends on theSeekFrom
enum, which includesEnd(i64)
. Therefore it would seem to me that the existingAsyncSeek
trait would be enough to use only negative ranges...? Or am I missing something?The text was updated successfully, but these errors were encountered: