Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[parquet] Fix file index result filter the row ranges missing rowgroup offset problem #4806

Merged

Conversation

Tan-JiaLiang
Copy link
Contributor

Purpose

Fix #4780 filter row ranges missing rowgroup offset.

Tests

API and Format

Documentation

@Tan-JiaLiang
Copy link
Contributor Author

Hi, @JingsongLi Can you please take a look? I made a mistake in #4780, it have been merged into v1.0 and need to have a fix.

@JingsongLi
Copy link
Contributor

@Tan-JiaLiang Thanks! I will take a look!

@Tan-JiaLiang
Copy link
Contributor Author

I ran the benchmark again

generate random values bound in [0, 1000) for BSI index field, and test predicate EQ with a random value(index result was no filter any ranges)

read Best/Avg Time(ms) Row Rate(K/s) Per Row(ns) Relative
without-bsi-index-1000-761 3982 / 4210 753.4 1327.3 1.0X
without-bsi-index-1000-391 3771 / 3810 795.6 1256.9 1.1X
without-bsi-index-1000-932 3848 / 3910 779.6 1282.6 1.0X
with-bsi-index-1000-761 3856 / 3885 778.1 1285.2 1.0X
with-bsi-index-1000-391 4062 / 4116 738.5 1354.0 1.0X
with-bsi-index-1000-932 4128 / 4148 726.7 1376.1 1.0X

generate random values bound in [0, 10000) for BSI index field, and test predicate EQ with a random value(index result maybe more filter 5-10 ranges)

read Best/Avg Time(ms) Row Rate(K/s) Per Row(ns) Relative
without-bsi-index-10000-2830 4270 / 4348 702.6 1423.2 1.0X
without-bsi-index-10000-1992 4055 / 4126 739.7 1351.8 1.1X
without-bsi-index-10000-1052 3933 / 4107 762.8 1311.0 1.1X
with-bsi-index-10000-2830 3046 / 3094 985.0 1015.3 1.4X
with-bsi-index-10000-1992 3327 / 3372 901.7 1109.0 1.3X
with-bsi-index-10000-1052 2840 / 2975 1056.3 946.7 1.5X

generate random values bound in [0, 20000) for BSI index field, and test predicate EQ with a random value(index result maybe more filter 15-20 ranges)

read Best/Avg Time(ms) Row Rate(K/s) Per Row(ns) Relative
without-bsi-index-20000-16842 4181 / 4447 717.5 1393.8 1.0X
without-bsi-index-20000-10432 4324 / 4484 693.8 1441.4 1.0X
without-bsi-index-20000-14386 4270 / 4420 702.5 1423.5 1.0X
with-bsi-index-20000-16842 2077 / 2213 1444.4 692.3 2.0X
with-bsi-index-20000-10432 2265 / 2408 1324.8 754.8 1.8X
with-bsi-index-20000-14386 2372 / 2442 1264.6 790.7 1.8X

values bound in [0, 50000), and test predicate EQ with a random value(index result maybe more filter 30-35 ranges)

read Best/Avg Time(ms) Row Rate(K/s) Per Row(ns) Relative
without-bsi-index-50000-12835 4280 / 4384 700.9 1426.8 1.0X
without-bsi-index-50000-4585 4456 / 4526 673.2 1485.5 1.0X
without-bsi-index-50000-28631 4300 / 4383 697.6 1433.4 1.0X
with-bsi-index-50000-12835 1117 / 1141 2686.9 372.2 3.8X
with-bsi-index-50000-4585 1256 / 1278 2389.1 418.6 3.4X
with-bsi-index-50000-28631 1322 / 1388 2269.7 440.6 3.2X

generate random values bound in [0, 100000) for BSI index field, and test predicate EQ with a random value(index result maybe more filter 30-45 ranges)

read Best/Avg Time(ms) Row Rate(K/s) Per Row(ns) Relative
without-bsi-index-100000-81044 4195 / 4216 715.2 1398.2 1.0X
without-bsi-index-100000-85026 4183 / 4332 717.3 1394.2 1.0X
without-bsi-index-100000-34524 4436 / 4705 676.2 1478.8 0.9X
with-bsi-index-100000-81044 739 / 764 4061.0 246.2 5.7X
with-bsi-index-100000-85026 512 / 582 5862.7 170.6 8.2X
with-bsi-index-100000-34524 524 / 603 5728.7 174.6 8.0X

generate random values bound in [0, 200000) for BSI index field, and test predicate EQ with a random value(index result maybe more filter 25-45 ranges)

read Best/Avg Time(ms) Row Rate(K/s) Per Row(ns) Relative
without-bsi-index-200000-131987 4100 / 4168 731.7 1366.8 1.0X
without-bsi-index-200000-119728 4110 / 4345 729.9 1370.0 1.0X
without-bsi-index-200000-18498 4293 / 4407 698.9 1430.9 1.0X
with-bsi-index-200000-131987 452 / 492 6637.3 150.7 9.1X
with-bsi-index-200000-119728 595 / 648 5040.5 198.4 6.9X
with-bsi-index-200000-18498 390 / 417 7699.7 129.9 10.5X

generate random values bound in [0, 500000) for BSI index field, and test predicate EQ with a random value(index result maybe filter all ranges)

read Best/Avg Time(ms) Row Rate(K/s) Per Row(ns) Relative
without-bsi-index-500000-90083 4162 / 4225 720.8 1387.3 1.0X
without-bsi-index-500000-277761 3935 / 4165 762.4 1311.6 1.1X
without-bsi-index-500000-290711 3949 / 4166 759.8 1316.2 1.1X
with-bsi-index-500000-90083 27 / 28 113039.1 8.8 156.8X
with-bsi-index-500000-277761 220 / 237 13616.6 73.4 18.9X
with-bsi-index-500000-290711 23 / 24 131814.8 7.6 182.9X

@JingsongLi
Copy link
Contributor

+1

@JingsongLi JingsongLi merged commit f9165ea into apache:master Dec 31, 2024
11 of 12 checks passed
JingsongLi pushed a commit that referenced this pull request Dec 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants