-
Notifications
You must be signed in to change notification settings - Fork 223
Reduced re-alloc in parquet #1337
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportBase: 83.63% // Head: 83.78% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #1337 +/- ##
==========================================
+ Coverage 83.63% 83.78% +0.14%
==========================================
Files 373 373
Lines 40284 40392 +108
==========================================
+ Hits 33692 33841 +149
+ Misses 6592 6551 -41
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
I noticed a significant performance regression on reading utf8 data. I think it is related to now defaulting to a |
| State::OptionalDictionary(_, _) | ||
| State::OptionalDelta(_, _) | ||
| State::FilteredOptionalDelta(_, _) => ( | ||
Binary::<O>::with_capacity(capacity, 0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This value capacity of 0
is very costly in most cases.
| State::RequiredDictionary(_) | ||
| State::Delta(_) | ||
| State::FilteredDelta(_) => ( | ||
Binary::<O>::with_capacity(capacity, 0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This value capacity of 0
is very costly in most cases.
@@ -59,7 +59,7 @@ fn read_dict<O: Offset>(data_type: DataType, dict: &DictPage) -> Box<dyn Array> | |||
|
|||
let values = SizedBinaryIter::new(&dict.buffer, dict.num_values); | |||
|
|||
let mut data = Binary::<O>::with_capacity(dict.num_values); | |||
let mut data = Binary::<O>::with_capacity(dict.num_values, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This value capacity of 0
is very costly in most cases.
@@ -36,10 +36,10 @@ impl<O: Offset> Pushable<usize> for Offsets<O> { | |||
|
|||
impl<O: Offset> Binary<O> { | |||
#[inline] | |||
pub fn with_capacity(capacity: usize) -> Self { | |||
pub fn with_capacity(capacity: usize, values_capacity: usize) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should have this signature:
pub fn with_capacity(capacity: usize, values_capacity: Option<usize>) -> Self {
and then
let values_capacity = values_capacity.unwrap_or(capacity.min(100) * 24);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the values_capacity is zero by default, maybe we could use a need_estimated
to reserve the space.
We use this in databend
e8e04f7
to
ef34171
Compare
Sorry for the delay on this one - if anyone would like to take an extra pass go ahead, otherwise I will merge it in. |
Closes #1324