Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

add bytes_estimate for binary push in parquet deserialize #1308

Merged
merged 8 commits into from
Dec 12, 2022

Conversation

sundy-li
Copy link
Collaborator

@sundy-li sundy-li commented Nov 30, 2022

image

values: Vec::with_capacity(capacity * 24),

Now Binary<O> will allocate too much memory even if the binary has a small size per item.

  1. Here we will reverse the capacity when pushing the 101st item by calculating the byte size per row.
  2. Add shrink_to_fit() during finish method.

@codecov
Copy link

codecov bot commented Nov 30, 2022

Codecov Report

Base: 83.12% // Head: 83.12% // Decreases project coverage by -0.00% ⚠️

Coverage data is based on head (145f406) compared to base (1417f88).
Patch coverage: 64.28% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1308      +/-   ##
==========================================
- Coverage   83.12%   83.12%   -0.01%     
==========================================
  Files         370      370              
  Lines       40158    40169      +11     
==========================================
+ Hits        33383    33390       +7     
- Misses       6775     6779       +4     
Impacted Files Coverage Δ
src/io/parquet/read/deserialize/binary/utils.rs 65.30% <37.50%> (-2.83%) ⬇️
src/io/parquet/read/deserialize/binary/basic.rs 80.43% <100.00%> (+0.24%) ⬆️
src/io/ipc/read/stream_async.rs 75.34% <0.00%> (-1.37%) ⬇️
src/bitmap/utils/slice_iterator.rs 97.56% <0.00%> (-1.22%) ⬇️
src/io/ipc/read/file_async.rs 60.82% <0.00%> (-0.38%) ⬇️
src/offset.rs 85.29% <0.00%> (-0.37%) ⬇️
src/io/ipc/read/file.rs 97.76% <0.00%> (+1.33%) ⬆️
src/chunk.rs 90.47% <0.00%> (+7.14%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@ritchie46
Copy link
Collaborator

image

Is the image before are after? Have you also got the counterpart? 😋

@sundy-li
Copy link
Collaborator Author

sundy-li commented Nov 30, 2022

Have you also got the counterpart?

It depends on the data, for a binary column with URL data, it improves profile percent by 19.42% ---> 15.69%.

Even with this pr, arrow2 read & decode large binary column is still slow.

I'll share a script and the data later, which shows that duckdb outperforms arrow2 & arrow-rs 2 times by reading same parquet files (I still can't find the reason).

@sundy-li
Copy link
Collaborator Author

sundy-li commented Nov 30, 2022

Here is the Bench script to read url column from parquet files using arrow2 & arrow1 compare with duckdb.

The result on my 16-core arch is: 

arrow: 501 ms ~ 550ms
arrow2: 816ms ~ 832 ms
duckdb: 390ms ~ 430ms

with this commit, arrow2 cost -> 628 ms

cc @ritchie46 maybe you will be interested in the result.

@ritchie46
Copy link
Collaborator

I'll share a script and the data later, which shows that duckdb outperforms arrow2 & arrow-rs 2 times by reading same parquet files (I still can't find the reason).

Yes, I also found those differences with duckdb. One easy win will be eliding the offset checks, but there is definitely more to win.

@sundy-li
Copy link
Collaborator Author

sundy-li commented Dec 4, 2022

@ritchie46
Duckdb is faster reading parquet because it does not convert parquet to an arrow array. If the parquet is in plain encoded, it will reference the plain bytes buffer and create a vector point to this buffer, this operation avoids copying the whole binary.

void StringColumnReader::PlainReference(shared_ptr<ByteBuffer> plain_data, Vector &result) {
	StringVector::AddBuffer(result, make_buffer<ParquetStringVectorBuffer>(move(plain_data)));
}

string_t StringParquetValueConversion::PlainRead(ByteBuffer &plain_data, ColumnReader &reader) {
	auto &scr = ((StringColumnReader &)reader);
	uint32_t str_len = scr.fixed_width_string_length == 0 ? plain_data.read<uint32_t>() : scr.fixed_width_string_length;

	plain_data.available(str_len);
	auto actual_str_len = ((StringColumnReader &)reader).VerifyString(plain_data.ptr, str_len);
	auto ret_str = string_t(plain_data.ptr, actual_str_len);
	plain_data.inc(str_len);
	return ret_str;
}

Copy link
Owner

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sundy-li ! Looks good to me. I do agree that we can prob do better - could you rebase against main and re-run the bench? The offset check has been fixed in main, so we should see some differences

@sundy-li
Copy link
Collaborator Author

sundy-li commented Dec 10, 2022

latest main: cost -> 770ms ~ 796 ms
this pr rebase latest main: cost -> 610ms ~ 651 ms

But it's better to introduce a vector decode path and reuse the decode buffer as arrow-rs. #1324

The current approach is streaming decode based which is pushed row by row.

@jorgecarleitao jorgecarleitao merged commit 1fcfd7c into jorgecarleitao:main Dec 12, 2022
ritchie46 pushed a commit to ritchie46/arrow2 that referenced this pull request Apr 5, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants