-
Notifications
You must be signed in to change notification settings - Fork 223
add bytes_estimate for binary push in parquet deserialize #1308
Conversation
Codecov ReportBase: 83.12% // Head: 83.12% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #1308 +/- ##
==========================================
- Coverage 83.12% 83.12% -0.01%
==========================================
Files 370 370
Lines 40158 40169 +11
==========================================
+ Hits 33383 33390 +7
- Misses 6775 6779 +4
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
It depends on the data, for a binary column with URL data, it improves profile percent by 19.42% ---> 15.69%. Even with this pr, arrow2 read & decode large binary column is still slow. I'll share a script and the data later, which shows that duckdb outperforms arrow2 & arrow-rs 2 times by reading same parquet files (I still can't find the reason). |
Here is the Bench script to read
cc @ritchie46 maybe you will be interested in the result. |
Yes, I also found those differences with duckdb. One easy win will be eliding the offset checks, but there is definitely more to win. |
@ritchie46
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sundy-li ! Looks good to me. I do agree that we can prob do better - could you rebase against main and re-run the bench? The offset check has been fixed in main, so we should see some differences
latest main: cost -> 770ms ~ 796 ms But it's better to introduce a vector decode path and reuse the decode buffer as arrow-rs. #1324 The current approach is streaming decode based which is pushed row by row. |
Now
Binary<O>
will allocate too much memory even if the binary has a small size per item.shrink_to_fit()
during finish method.