Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

branch-3.0: [Fix](ORC) Not push down fixed char type in orc reader #45484 #45776

Open
wants to merge 1 commit into
base: branch-3.0
Choose a base branch
from

Conversation

github-actions[bot]
Copy link
Contributor

Cherry-picked from #45484

### What problem does this PR solve?
Problem Summary:
In Hive, the ORC file format supports fixed-length CHAR types (CHAR(n))
by padding strings with spaces to ensure the fixed length. When data is
written into ORC tables, the actual stored value includes additional
trailing spaces to meet the defined length. These padded spaces are also
considered during the computation of statistics.

However, in Doris, fixed-length CHAR types (CHAR(n)) and variable-length
VARCHAR types are internally represented as the same type. Doris does
not pad CHAR values with spaces and treats them as regular strings. As a
result, when Doris reads ORC files generated by Hive and parses the
statistics, the differences in the handling of CHAR types between the
two systems can lead to inconsistencies or incorrect statistics.
```sql
create table fixed_char_table (
  i int,
  c char(2)
) stored as orc;

insert into fixed_char_table values(1,'a'),(2,'b '), (3,'cd');
select * from fixed_char_table where c = 'a';
```
before
```text
empty
```
after
```text
1	a
```

If a Hive table undergoes a schema change, such as a column’s type being
modified from INT to STRING, predicate pushdown should be disabled in
such cases. Performing predicate pushdown under these circumstances may
lead to incorrect filtering, as the type mismatch can cause errors or
unexpected behavior during query execution.
```sql
create table type_changed_table (
  id int,
  name string 
) stored as orc;
insert into type_changed_table values (1, 'Alice'), (2, 'Bob'), (3, 'Charlie');
ALTER TABLE type_changed_table CHANGE COLUMN id id STRING;
select * from type_changed_table where id = '1';
select
```
before
```text
empty
```
after
```text
1	a
```
### Release note
[fix](orc) Not push down fixed char type in orc reader #45484
@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@dataroaring dataroaring reopened this Dec 23, 2024
@hello-stephen
Copy link
Contributor

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 40390 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 2043703fa4e4cfd229b20621ca667910e200f3b8, data reload: false

------ Round 1 ----------------------------------
q1	17564	7764	7263	7263
q2	2075	168	168	168
q3	10700	1068	1193	1068
q4	10546	767	726	726
q5	7771	2784	2808	2784
q6	232	147	143	143
q7	964	610	600	600
q8	9585	1959	2019	1959
q9	7716	6370	6450	6370
q10	6968	2269	2262	2262
q11	467	261	261	261
q12	401	206	205	205
q13	17779	2983	2964	2964
q14	232	215	209	209
q15	562	510	520	510
q16	651	606	593	593
q17	975	580	588	580
q18	7194	6469	6576	6469
q19	1749	1099	1065	1065
q20	477	212	208	208
q21	3900	3145	2990	2990
q22	1090	993	997	993
Total cold run time: 109598 ms
Total hot run time: 40390 ms

----- Round 2, with runtime_filter_mode=off -----
q1	7381	7274	7215	7215
q2	329	232	245	232
q3	2944	2883	3099	2883
q4	2124	1841	1790	1790
q5	5699	5665	5705	5665
q6	219	141	142	141
q7	2304	1783	1769	1769
q8	3321	3574	3443	3443
q9	8894	8933	8850	8850
q10	3524	3529	3503	3503
q11	604	509	501	501
q12	813	582	587	582
q13	16512	3077	3137	3077
q14	317	277	269	269
q15	560	521	510	510
q16	709	668	644	644
q17	1849	1643	1645	1643
q18	8278	7702	7587	7587
q19	3455	1488	1572	1488
q20	2081	1875	1850	1850
q21	5430	5234	5403	5234
q22	1129	984	1038	984
Total cold run time: 78476 ms
Total hot run time: 59860 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 194631 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 2043703fa4e4cfd229b20621ca667910e200f3b8, data reload: false

query1	1301	921	902	902
query2	6243	2037	1983	1983
query3	10984	4300	4252	4252
query4	67292	29013	23344	23344
query5	5712	432	446	432
query6	463	175	176	175
query7	5698	306	306	306
query8	333	225	224	224
query9	9219	2668	2663	2663
query10	510	272	249	249
query11	18009	15302	15730	15302
query12	153	103	102	102
query13	1573	419	419	419
query14	10248	7127	6500	6500
query15	211	175	172	172
query16	6854	472	471	471
query17	1377	560	556	556
query18	1765	330	318	318
query19	224	166	153	153
query20	121	115	112	112
query21	69	48	43	43
query22	4410	4696	4612	4612
query23	34941	33940	34082	33940
query24	6063	2859	2916	2859
query25	535	421	414	414
query26	704	168	174	168
query27	2023	305	311	305
query28	4313	2509	2501	2501
query29	732	465	439	439
query30	246	161	160	160
query31	1042	827	827	827
query32	65	55	55	55
query33	477	298	285	285
query34	910	496	494	494
query35	843	758	749	749
query36	1071	980	936	936
query37	119	76	73	73
query38	3992	4022	3905	3905
query39	1522	1498	1505	1498
query40	152	87	110	87
query41	47	47	44	44
query42	106	95	94	94
query43	531	464	476	464
query44	1147	782	803	782
query45	185	167	170	167
query46	1146	700	730	700
query47	1997	1879	1893	1879
query48	452	367	387	367
query49	735	362	395	362
query50	813	408	427	408
query51	7252	7156	7074	7074
query52	99	85	87	85
query53	247	179	176	176
query54	549	432	431	431
query55	75	76	73	73
query56	236	236	219	219
query57	1198	1099	1072	1072
query58	211	202	208	202
query59	3053	2790	2874	2790
query60	275	251	243	243
query61	102	103	103	103
query62	778	651	666	651
query63	213	183	192	183
query64	1771	643	612	612
query65	3254	3189	3134	3134
query66	753	296	290	290
query67	15631	15331	15240	15240
query68	4552	548	534	534
query69	414	247	249	247
query70	1171	1135	1127	1127
query71	422	249	243	243
query72	6532	3868	3947	3868
query73	751	338	337	337
query74	10215	8769	8951	8769
query75	3338	2643	2618	2618
query76	2466	1082	998	998
query77	481	266	263	263
query78	10660	9955	9568	9568
query79	7709	579	564	564
query80	1836	405	409	405
query81	550	239	230	230
query82	1280	111	115	111
query83	251	146	140	140
query84	294	84	76	76
query85	1776	300	284	284
query86	473	264	287	264
query87	4536	4261	4252	4252
query88	5332	2338	2347	2338
query89	419	285	284	284
query90	2019	178	189	178
query91	180	141	144	141
query92	61	46	46	46
query93	6123	522	523	522
query94	856	289	290	289
query95	345	237	241	237
query96	611	284	283	283
query97	3334	3111	3153	3111
query98	227	197	194	194
query99	1616	1308	1280	1280
Total cold run time: 336205 ms
Total hot run time: 194631 ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants