You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.
parquet's plain encoding of each row is [length as i32][bytes]. Could you explain how would this could be more efficiently decoded?
I think the root cause of the issue is that when the batch is very large, we need to perform large re-allocations. I do think we can do better, though: given encoded binary values, we can estimate their size via encoded.len() - 4 * num_values. This only works for required. For optional, we could use encoded.len() - 4 * (num_values - num_nulls)
Let's look at these codes in
https://github.com/jorgecarleitao/arrow2/blob/main/src/io/parquet/read/deserialize/primitive/basic.rs#L219-L226
It had extra
memcpy
invalues.extend
and decode, I think maybe we could optimize it by using Buffer clone.The first motivation is to move
to
@jorgecarleitao what do you think about this?
I found arrow-rs had addressed this improvement in https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/array_reader/byte_array.rs#L115-L138
The text was updated successfully, but these errors were encountered: