Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[enchement](utf8)import enable_text_validate_utf8 session var #45537

Merged
merged 2 commits into from
Dec 26, 2024

Conversation

hubgeter
Copy link
Contributor

@hubgeter hubgeter commented Dec 17, 2024

What problem does this PR solve?

Problem Summary:
When reading text format files in Hive catalog and TVF, sometimes you may encounter the exception Only support csv data in utf8 codec.
I introduced a new session variable enable_text_validate_utf8 to control whether to check the utf8 format.

Release note

Introduced enable_text_validate_utf8 session variable to control whether to check the utf8 format.

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

1 similar comment
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@@ -2339,6 +2341,9 @@ public void setIgnoreShapePlanNodes(String ignoreShapePlanNodes) {
})
public boolean enableAutoCreateWhenOverwrite = false;

@VariableMgr.VarAttr(name = ENABLE_TEXT_VALIDATE_UTF8, needForward = true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add description field

@hubgeter
Copy link
Contributor Author

run buildall

@hubgeter hubgeter marked this pull request as ready for review December 23, 2024 09:59
@doris-robot
Copy link

TPC-H: Total hot run time: 40316 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 3332bee46a808fbadf612f2951e42f9a3caec88d, data reload: false

------ Round 1 ----------------------------------
q1	17585	7678	7407	7407
q2	2057	181	185	181
q3	10555	1097	1267	1097
q4	10573	744	759	744
q5	7611	2726	2686	2686
q6	247	157	157	157
q7	1016	637	628	628
q8	9269	1903	1951	1903
q9	6736	6464	6501	6464
q10	7063	2316	2311	2311
q11	478	268	279	268
q12	423	232	228	228
q13	17779	2931	3003	2931
q14	254	214	209	209
q15	562	511	506	506
q16	656	583	589	583
q17	1001	602	528	528
q18	7431	6723	6781	6723
q19	1335	982	1079	982
q20	467	184	187	184
q21	4075	3434	3289	3289
q22	382	315	307	307
Total cold run time: 107555 ms
Total hot run time: 40316 ms

----- Round 2, with runtime_filter_mode=off -----
q1	7384	7410	7354	7354
q2	328	226	226	226
q3	2970	3009	2967	2967
q4	2189	1997	1903	1903
q5	5685	5643	5662	5643
q6	229	144	151	144
q7	2237	1803	1817	1803
q8	3396	3637	3586	3586
q9	8882	8891	8999	8891
q10	3619	3589	3579	3579
q11	616	509	490	490
q12	838	626	645	626
q13	12500	3132	3165	3132
q14	308	278	274	274
q15	558	520	505	505
q16	702	647	643	643
q17	1878	1647	1599	1599
q18	8339	7726	7654	7654
q19	1734	1628	1680	1628
q20	2152	1888	1895	1888
q21	5777	5538	5512	5512
q22	655	567	599	567
Total cold run time: 72976 ms
Total hot run time: 60614 ms

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 38.80% (10094/26016)
Line Coverage: 29.79% (85144/285856)
Region Coverage: 28.90% (43462/150402)
Branch Coverage: 25.43% (22154/87118)
Coverage Report: http://coverage.selectdb-in.cc/coverage/3332bee46a808fbadf612f2951e42f9a3caec88d_3332bee46a808fbadf612f2951e42f9a3caec88d/report/index.html

@doris-robot
Copy link

TPC-DS: Total hot run time: 198521 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 3332bee46a808fbadf612f2951e42f9a3caec88d, data reload: false

query1	1317	958	904	904
query2	6234	2535	2487	2487
query3	11126	4890	4852	4852
query4	33344	23619	23349	23349
query5	4565	466	473	466
query6	310	197	181	181
query7	4015	312	308	308
query8	291	230	218	218
query9	9604	2765	2760	2760
query10	481	244	243	243
query11	17936	15100	15201	15100
query12	160	102	104	102
query13	1562	416	408	408
query14	10185	8067	7960	7960
query15	289	191	196	191
query16	7406	472	542	472
query17	1726	604	631	604
query18	1545	319	342	319
query19	385	178	173	173
query20	131	124	114	114
query21	216	118	115	115
query22	4790	4793	4499	4499
query23	34679	34015	33781	33781
query24	10523	2656	2615	2615
query25	617	398	402	398
query26	1309	159	159	159
query27	2368	335	337	335
query28	7700	2547	2501	2501
query29	724	422	431	422
query30	227	153	155	153
query31	1065	844	826	826
query32	99	59	57	57
query33	791	303	319	303
query34	1021	532	535	532
query35	887	762	803	762
query36	1117	945	931	931
query37	128	77	74	74
query38	4332	4159	4119	4119
query39	1515	1446	1457	1446
query40	213	111	102	102
query41	45	43	46	43
query42	119	105	101	101
query43	556	510	529	510
query44	1327	844	840	840
query45	197	165	168	165
query46	1197	767	742	742
query47	2071	1897	1924	1897
query48	446	334	323	323
query49	908	409	409	409
query50	846	410	405	405
query51	7270	7254	7131	7131
query52	106	94	93	93
query53	285	189	195	189
query54	1199	435	432	432
query55	83	80	80	80
query56	277	240	250	240
query57	1278	1171	1175	1171
query58	243	242	235	235
query59	3444	3308	3359	3308
query60	290	267	266	266
query61	110	111	116	111
query62	887	696	720	696
query63	228	195	198	195
query64	3925	682	652	652
query65	3295	3228	3198	3198
query66	754	302	308	302
query67	15930	15450	15470	15450
query68	5435	559	569	559
query69	481	258	259	258
query70	1253	1130	1179	1130
query71	479	254	262	254
query72	7000	4141	4008	4008
query73	821	369	376	369
query74	10046	8808	9037	8808
query75	3391	2673	2655	2655
query76	3702	1262	1098	1098
query77	566	280	282	280
query78	10386	10297	9671	9671
query79	1119	652	605	605
query80	853	536	440	440
query81	562	237	230	230
query82	397	115	115	115
query83	283	148	163	148
query84	233	66	71	66
query85	1163	317	313	313
query86	376	280	287	280
query87	4479	4313	4514	4313
query88	3460	2241	2258	2241
query89	432	293	293	293
query90	2089	185	191	185
query91	145	108	103	103
query92	64	55	50	50
query93	1616	572	556	556
query94	926	300	285	285
query95	347	253	255	253
query96	642	275	282	275
query97	2865	2680	2668	2668
query98	214	195	190	190
query99	1616	1337	1297	1297
Total cold run time: 301690 ms
Total hot run time: 198521 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 33.14 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 3332bee46a808fbadf612f2951e42f9a3caec88d, data reload: false

query1	0.03	0.03	0.03
query2	0.09	0.05	0.05
query3	0.23	0.05	0.05
query4	1.64	0.08	0.08
query5	0.43	0.41	0.41
query6	1.17	0.67	0.66
query7	0.02	0.01	0.02
query8	0.06	0.05	0.04
query9	0.55	0.50	0.50
query10	0.54	0.59	0.55
query11	0.16	0.12	0.12
query12	0.15	0.13	0.13
query13	0.60	0.61	0.59
query14	2.75	2.72	2.73
query15	0.91	0.82	0.83
query16	0.38	0.38	0.38
query17	1.07	1.04	1.06
query18	0.19	0.18	0.20
query19	1.87	1.88	2.04
query20	0.02	0.01	0.02
query21	15.36	0.68	0.66
query22	4.10	7.14	1.73
query23	18.21	1.40	1.28
query24	2.24	0.22	0.22
query25	0.15	0.08	0.09
query26	0.28	0.18	0.18
query27	0.08	0.08	0.08
query28	13.26	1.17	1.16
query29	12.65	3.32	3.30
query30	0.25	0.06	0.06
query31	2.84	0.41	0.40
query32	3.23	0.49	0.48
query33	3.07	3.16	3.12
query34	17.26	4.53	4.55
query35	4.60	4.52	4.59
query36	0.67	0.48	0.49
query37	0.21	0.15	0.16
query38	0.16	0.15	0.16
query39	0.05	0.04	0.05
query40	0.17	0.13	0.13
query41	0.10	0.06	0.05
query42	0.07	0.05	0.06
query43	0.05	0.04	0.04
Total cold run time: 111.92 s
Total hot run time: 33.14 s

Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Dec 25, 2024
Copy link
Contributor

PR approved by anyone and no changes requested.

Copy link
Contributor

@kaka11chen kaka11chen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@morningman morningman merged commit 732725d into apache:master Dec 26, 2024
26 of 28 checks passed
hubgeter added a commit to hubgeter/doris that referenced this pull request Dec 27, 2024
…#45537)

Problem Summary:
When reading text format files in Hive catalog and TVF, sometimes you
may encounter the exception `Only support csv data in utf8 codec`.
I introduced a new session variable `enable_text_validate_utf8` to
control whether to check the utf8 format.

Introduced `enable_text_validate_utf8` session variable to control
whether to check the utf8 format.
hubgeter added a commit to hubgeter/doris that referenced this pull request Dec 27, 2024
…#45537)

Problem Summary:
When reading text format files in Hive catalog and TVF, sometimes you
may encounter the exception `Only support csv data in utf8 codec`.
I introduced a new session variable `enable_text_validate_utf8` to
control whether to check the utf8 format.

Introduced `enable_text_validate_utf8` session variable to control
whether to check the utf8 format.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/2.1.8-merged dev/3.0.4-merged reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants