Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

branch-3.0: [fix](parquet) impl has_dict_page to replace old logic and fix write empty parquet row group bug #45740 #45953

Merged
merged 1 commit into from
Dec 25, 2024

Conversation

github-actions[bot]
Copy link
Contributor

Cherry-picked from #45740

…empty parquet row group bug (#45740)

### What problem does this PR solve?
Problem Summary:

Checks if the given column has a dictionary page.
 
This function determines the presence of a dictionary page by checking
the `dictionary_page_offset` field in the column metadata. The
`dictionary_page_offset` must be set and greater than 0, and it must be
less than the `data_page_offset`.
 
The reason for these checks is based on the implementation in the Java
version of ORC, where `dictionary_page_offset` is used to indicate the
absence of a dictionary. Additionally, Parquet may write an empty row
group, in which case the dictionary page content would be empty, and
thus the dictionary page should not be read.
 
See https://github.com/apache/arrow/pull/2667/files
@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@dataroaring dataroaring reopened this Dec 25, 2024
@hello-stephen
Copy link
Contributor

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 41168 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit e1ce2bf54829b220514e5e40cfc89c67bcc3dbc5, data reload: false

------ Round 1 ----------------------------------
q1	17578	7490	7248	7248
q2	2065	169	162	162
q3	10558	1124	1176	1124
q4	10224	780	765	765
q5	7731	2876	2854	2854
q6	241	155	153	153
q7	971	608	605	605
q8	9359	1953	1970	1953
q9	6670	6398	6468	6398
q10	7020	2306	2295	2295
q11	484	269	285	269
q12	481	212	211	211
q13	17797	2993	3019	2993
q14	240	209	222	209
q15	566	530	531	530
q16	701	613	603	603
q17	978	640	536	536
q18	7271	6838	6856	6838
q19	1375	1062	988	988
q20	477	207	199	199
q21	3988	3336	3242	3242
q22	1085	993	1011	993
Total cold run time: 107860 ms
Total hot run time: 41168 ms

----- Round 2, with runtime_filter_mode=off -----
q1	7236	7253	7167	7167
q2	326	231	232	231
q3	2915	2816	2765	2765
q4	1942	1739	1717	1717
q5	5405	5479	5495	5479
q6	221	138	139	138
q7	2116	1749	1696	1696
q8	3228	3378	3427	3378
q9	8550	8476	8525	8476
q10	3476	3432	3431	3431
q11	590	489	526	489
q12	773	571	613	571
q13	10059	3032	2976	2976
q14	298	257	257	257
q15	565	523	510	510
q16	703	647	660	647
q17	1798	1587	1575	1575
q18	7638	7588	7317	7317
q19	1662	1494	1564	1494
q20	2066	1806	1782	1782
q21	5392	5105	5191	5105
q22	1134	986	1012	986
Total cold run time: 68093 ms
Total hot run time: 58187 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 191142 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit e1ce2bf54829b220514e5e40cfc89c67bcc3dbc5, data reload: false

query1	938	374	363	363
query2	6322	2094	2086	2086
query3	6462	213	223	213
query4	33969	23440	23604	23440
query5	4331	478	450	450
query6	278	187	182	182
query7	3380	323	311	311
query8	261	236	226	226
query9	6016	2709	2688	2688
query10	438	303	272	272
query11	15985	15102	15179	15102
query12	153	100	102	100
query13	2201	434	431	431
query14	10062	6422	7090	6422
query15	247	174	178	174
query16	8196	450	501	450
query17	1625	575	561	561
query18	2149	322	319	319
query19	303	162	150	150
query20	117	112	111	111
query21	62	50	49	49
query22	4828	4520	4432	4432
query23	34580	33975	34192	33975
query24	11611	2884	2870	2870
query25	666	419	402	402
query26	1856	171	171	171
query27	2909	299	301	299
query28	8219	2503	2479	2479
query29	1042	419	410	410
query30	341	158	158	158
query31	997	774	812	774
query32	94	57	55	55
query33	789	276	270	270
query34	1148	490	519	490
query35	909	741	730	730
query36	1094	917	951	917
query37	295	74	75	74
query38	4004	3801	3845	3801
query39	1462	1424	1416	1416
query40	219	83	84	83
query41	51	46	48	46
query42	110	97	94	94
query43	540	501	476	476
query44	1256	823	810	810
query45	181	167	168	167
query46	1146	718	725	718
query47	1934	1815	1872	1815
query48	470	374	371	371
query49	1083	389	389	389
query50	810	413	401	401
query51	7306	7131	7074	7074
query52	107	88	85	85
query53	261	181	179	179
query54	1029	445	451	445
query55	78	76	80	76
query56	273	262	245	245
query57	1282	1101	1086	1086
query58	228	203	205	203
query59	3183	2831	2877	2831
query60	280	251	249	249
query61	115	111	111	111
query62	846	671	682	671
query63	218	191	184	184
query64	5268	668	639	639
query65	3283	3222	3205	3205
query66	1438	323	310	310
query67	16175	15636	15746	15636
query68	4144	579	611	579
query69	442	272	261	261
query70	1181	1120	1067	1067
query71	397	258	254	254
query72	6448	4115	4155	4115
query73	763	349	346	346
query74	10403	8909	9012	8909
query75	3382	2620	2648	2620
query76	2990	1088	1031	1031
query77	388	281	272	272
query78	10583	9755	9709	9709
query79	1080	583	599	583
query80	630	426	428	426
query81	505	239	239	239
query82	211	118	122	118
query83	174	148	154	148
query84	252	80	81	80
query85	896	314	286	286
query86	335	305	288	288
query87	4443	4490	4270	4270
query88	3548	2404	2362	2362
query89	385	296	289	289
query90	1944	187	185	185
query91	202	148	171	148
query92	60	51	50	50
query93	1042	561	555	555
query94	678	308	277	277
query95	345	259	254	254
query96	621	287	275	275
query97	3385	3250	3205	3205
query98	217	207	188	188
query99	1493	1337	1331	1331
Total cold run time: 293076 ms
Total hot run time: 191142 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 33.51 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit e1ce2bf54829b220514e5e40cfc89c67bcc3dbc5, data reload: false

query1	0.04	0.04	0.03
query2	0.06	0.03	0.02
query3	0.23	0.06	0.07
query4	1.61	0.10	0.11
query5	0.52	0.50	0.51
query6	1.13	0.75	0.73
query7	0.03	0.02	0.02
query8	0.03	0.03	0.03
query9	0.55	0.50	0.51
query10	0.55	0.56	0.58
query11	0.15	0.10	0.11
query12	0.14	0.12	0.11
query13	0.61	0.60	0.60
query14	2.91	3.00	3.00
query15	0.90	0.84	0.82
query16	0.38	0.36	0.39
query17	1.00	1.02	1.03
query18	0.24	0.21	0.23
query19	1.92	1.89	2.00
query20	0.01	0.01	0.01
query21	15.36	0.59	0.58
query22	2.56	1.76	1.72
query23	17.22	0.85	0.85
query24	3.26	1.52	1.48
query25	0.27	0.14	0.12
query26	0.62	0.13	0.13
query27	0.05	0.03	0.04
query28	9.80	1.12	1.08
query29	12.50	3.26	3.23
query30	0.25	0.06	0.06
query31	2.85	0.39	0.39
query32	3.25	0.45	0.45
query33	3.00	3.05	3.03
query34	17.04	4.46	4.46
query35	4.62	4.45	4.45
query36	0.66	0.49	0.48
query37	0.09	0.06	0.06
query38	0.05	0.03	0.03
query39	0.04	0.02	0.03
query40	0.15	0.13	0.13
query41	0.08	0.02	0.03
query42	0.04	0.02	0.02
query43	0.03	0.02	0.02
Total cold run time: 106.8 s
Total hot run time: 33.51 s

@morningman morningman merged commit 09cacab into branch-3.0 Dec 25, 2024
19 of 21 checks passed
@github-actions github-actions bot deleted the auto-pick-45740-branch-3.0 branch December 25, 2024 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants