Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat](metrics) Unify metrics of thread pool #43144

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

zhiqiang-hhhh
Copy link
Contributor

@zhiqiang-hhhh zhiqiang-hhhh commented Nov 2, 2024

What problem does this PR solve?

Add metrics for all thread pool, more specifically, for all ThreadPool objects.
All thread pool will have following metrics:

  1. thread_pool_active_threads
  2. thread_pool_queue_size
  3. thread_pool_max_queue_size
  4. thread_pool_max_threads
  5. task_execution_time_ns_avg_in_last_1000_times
  6. task_wait_worker_ns_avg_in_last_1000_times

A new class IntervalHistogramStat is created for interval histogram calculation.

Metrics is updated by hook method when they are needed by prometheus.

Check List (For Committer)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No colde files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.
  • Release note

    None

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@zhiqiang-hhhh
Copy link
Contributor Author

run buildall

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

be/src/vec/exec/scan/scanner_scheduler.cpp Outdated Show resolved Hide resolved
@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 37.82% (9821/25967)
Line Coverage: 28.99% (81627/281564)
Region Coverage: 28.25% (42144/149158)
Branch Coverage: 24.83% (21381/86106)
Coverage Report: http://coverage.selectdb-in.cc/coverage/b36415796af4881788f26f82f81791a16e8e4608_b36415796af4881788f26f82f81791a16e8e4608/report/index.html

@zhiqiang-hhhh
Copy link
Contributor Author

run buildall

@zhiqiang-hhhh zhiqiang-hhhh marked this pull request as draft November 4, 2024 00:59
@zhiqiang-hhhh
Copy link
Contributor Author

run buildall

1 similar comment
@zhiqiang-hhhh
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 37.82% (9820/25967)
Line Coverage: 28.99% (81616/281564)
Region Coverage: 28.25% (42141/149158)
Branch Coverage: 24.83% (21379/86106)
Coverage Report: http://coverage.selectdb-in.cc/coverage/b36415796af4881788f26f82f81791a16e8e4608_b36415796af4881788f26f82f81791a16e8e4608/report/index.html

@zhiqiang-hhhh
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 37.82% (9821/25970)
Line Coverage: 28.97% (81596/281672)
Region Coverage: 28.24% (42131/149203)
Branch Coverage: 24.82% (21381/86140)
Coverage Report: http://coverage.selectdb-in.cc/coverage/7eb8642f9a587895a5c757f61d356507495daed1_7eb8642f9a587895a5c757f61d356507495daed1/report/index.html

@zhiqiang-hhhh
Copy link
Contributor Author

image

Fail of test has nothing to do with this pr.

@zhiqiang-hhhh zhiqiang-hhhh marked this pull request as ready for review November 4, 2024 12:51
wangbo
wangbo previously approved these changes Nov 6, 2024
Copy link
Contributor

@wangbo wangbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Nov 6, 2024
Copy link
Contributor

github-actions bot commented Nov 6, 2024

PR approved by at least one committer and no changes requested.

Copy link
Contributor

github-actions bot commented Nov 6, 2024

PR approved by anyone and no changes requested.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

be/src/util/interval_histogram.h Show resolved Hide resolved
@zhiqiang-hhhh
Copy link
Contributor Author

run buildall

@zhiqiang-hhhh zhiqiang-hhhh changed the title [opt](metrics) More metrics for scanner [feat](metrics) Unify metrics of thread pool Dec 17, 2024
@zhiqiang-hhhh
Copy link
Contributor Author

run buildall

@zhiqiang-hhhh zhiqiang-hhhh marked this pull request as ready for review December 19, 2024 12:29
@zhiqiang-hhhh
Copy link
Contributor Author

run buildall

1 similar comment
@zhiqiang-hhhh
Copy link
Contributor Author

run buildall

@zhiqiang-hhhh
Copy link
Contributor Author

run buildall

@zhiqiang-hhhh
Copy link
Contributor Author

run buildall

#include "util/thread.h"

namespace doris {

// The name of these varialbs will be useds as metric name in prometheus.
DEFINE_GAUGE_METRIC_PROTOTYPE_2ARG(thread_pool_running_tasks, MetricUnit::NOUNIT);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

怎么区分不同thread pool的名字?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在 label 里面区分

@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Dec 20, 2024
@zhiqiang-hhhh
Copy link
Contributor Author

run buildall

@zhiqiang-hhhh
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 40126 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 614c0c91c124b261b20f5260a91fc84904ba8980, data reload: false

------ Round 1 ----------------------------------
q1	17583	7598	7332	7332
q2	2059	178	170	170
q3	10550	1135	1172	1135
q4	10227	734	723	723
q5	7581	2768	2742	2742
q6	240	147	146	146
q7	999	626	603	603
q8	9253	1920	1940	1920
q9	6755	6527	6424	6424
q10	6984	2266	2310	2266
q11	462	267	263	263
q12	428	220	217	217
q13	17771	2979	2882	2882
q14	252	214	209	209
q15	554	510	494	494
q16	690	606	600	600
q17	988	592	535	535
q18	7512	6713	6762	6713
q19	1336	1024	1017	1017
q20	492	180	186	180
q21	4096	3264	3237	3237
q22	392	338	318	318
Total cold run time: 107204 ms
Total hot run time: 40126 ms

----- Round 2, with runtime_filter_mode=off -----
q1	7211	7217	7218	7217
q2	325	230	222	222
q3	2888	2805	3114	2805
q4	2133	1867	1838	1838
q5	5692	5698	5734	5698
q6	223	138	137	137
q7	2230	1794	1848	1794
q8	3421	3532	3515	3515
q9	9017	8915	9030	8915
q10	3605	3569	3556	3556
q11	607	500	503	500
q12	806	610	661	610
q13	11666	3073	3111	3073
q14	297	264	271	264
q15	549	518	525	518
q16	702	628	625	625
q17	1802	1611	1549	1549
q18	7894	7537	7356	7356
q19	1696	1524	1533	1524
q20	2035	1802	1810	1802
q21	5400	5382	5277	5277
q22	651	595	592	592
Total cold run time: 70850 ms
Total hot run time: 59387 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 189275 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 614c0c91c124b261b20f5260a91fc84904ba8980, data reload: false

query1	963	417	389	389
query2	6520	2335	2370	2335
query3	6705	219	208	208
query4	33788	23286	23345	23286
query5	4252	471	478	471
query6	294	225	179	179
query7	4648	311	321	311
query8	350	241	251	241
query9	9563	2691	2683	2683
query10	458	257	238	238
query11	17918	15200	15113	15113
query12	157	101	105	101
query13	1665	410	393	393
query14	10544	6630	6634	6630
query15	259	188	185	185
query16	8116	430	449	430
query17	1577	568	539	539
query18	2153	294	288	288
query19	352	158	156	156
query20	123	112	113	112
query21	210	137	100	100
query22	4592	4247	4364	4247
query23	34593	34640	33426	33426
query24	11363	2481	2478	2478
query25	707	385	393	385
query26	1870	146	148	146
query27	2917	323	325	323
query28	8223	2416	2417	2416
query29	1045	401	401	401
query30	312	144	150	144
query31	1054	808	800	800
query32	93	60	59	59
query33	770	295	292	292
query34	1010	520	519	519
query35	843	711	778	711
query36	1113	937	927	927
query37	280	75	76	75
query38	4145	4235	4038	4038
query39	1481	1496	1390	1390
query40	292	100	100	100
query41	53	45	48	45
query42	119	100	100	100
query43	525	490	486	486
query44	1291	795	833	795
query45	191	165	166	165
query46	1150	717	700	700
query47	1944	1813	1845	1813
query48	411	323	313	313
query49	1231	387	406	387
query50	809	379	378	378
query51	7128	7063	7060	7060
query52	100	90	89	89
query53	253	183	192	183
query54	1206	395	402	395
query55	89	78	75	75
query56	261	238	252	238
query57	1283	1117	1093	1093
query58	259	228	239	228
query59	3264	2985	2873	2873
query60	284	279	244	244
query61	116	107	104	104
query62	879	654	677	654
query63	216	182	196	182
query64	5022	678	641	641
query65	3245	3214	3252	3214
query66	1185	303	308	303
query67	15753	15524	15466	15466
query68	5943	541	541	541
query69	430	265	255	255
query70	1190	1150	1213	1150
query71	330	254	260	254
query72	6533	4250	4279	4250
query73	758	365	356	356
query74	9975	8835	8938	8835
query75	3452	2625	2668	2625
query76	3609	1048	1105	1048
query77	511	271	280	271
query78	10262	9390	9390	9390
query79	2331	607	595	595
query80	1126	417	431	417
query81	525	236	237	236
query82	625	116	121	116
query83	238	146	160	146
query84	230	68	70	68
query85	1667	303	297	297
query86	492	299	268	268
query87	4600	4335	4373	4335
query88	3921	2219	2194	2194
query89	408	292	302	292
query90	2054	185	186	185
query91	143	104	109	104
query92	60	51	53	51
query93	2204	542	537	537
query94	741	287	248	248
query95	362	249	275	249
query96	618	285	291	285
query97	2819	2707	2702	2702
query98	219	191	199	191
query99	1553	1365	1295	1295
Total cold run time: 305514 ms
Total hot run time: 189275 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 32.99 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 614c0c91c124b261b20f5260a91fc84904ba8980, data reload: false

query1	0.03	0.04	0.03
query2	0.07	0.03	0.03
query3	0.24	0.07	0.07
query4	1.60	0.11	0.10
query5	0.44	0.40	0.41
query6	1.15	0.66	0.65
query7	0.02	0.02	0.02
query8	0.04	0.03	0.03
query9	0.58	0.51	0.50
query10	0.56	0.60	0.55
query11	0.15	0.11	0.11
query12	0.14	0.10	0.12
query13	0.60	0.60	0.60
query14	2.84	2.74	2.74
query15	0.90	0.83	0.82
query16	0.38	0.38	0.36
query17	1.03	1.04	1.01
query18	0.23	0.21	0.20
query19	1.94	1.88	2.05
query20	0.01	0.01	0.01
query21	15.36	0.60	0.59
query22	2.53	1.59	1.98
query23	16.83	1.20	0.81
query24	3.55	1.68	1.46
query25	0.27	0.16	0.08
query26	0.59	0.14	0.14
query27	0.05	0.03	0.04
query28	9.74	1.11	1.07
query29	12.59	3.22	3.20
query30	0.24	0.07	0.06
query31	2.85	0.38	0.38
query32	3.28	0.47	0.47
query33	3.16	3.23	3.24
query34	16.98	4.47	4.46
query35	4.50	4.43	4.43
query36	0.70	0.48	0.51
query37	0.10	0.06	0.06
query38	0.04	0.03	0.03
query39	0.04	0.02	0.02
query40	0.18	0.13	0.12
query41	0.08	0.03	0.02
query42	0.04	0.02	0.02
query43	0.04	0.03	0.03
Total cold run time: 106.69 s
Total hot run time: 32.99 s

@@ -139,7 +139,6 @@ namespace doris {
class PBackendService_Stub;
class PFunctionService_Stub;

DEFINE_GAUGE_METRIC_PROTOTYPE_2ARG(scanner_thread_pool_queue_size, MetricUnit::NOUNIT);
DEFINE_GAUGE_METRIC_PROTOTYPE_2ARG(send_batch_thread_pool_thread_num, MetricUnit::NOUNIT);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_send_batch_thread_pool 这个相关的监控,可以不考虑兼容性,直接迁移到通用的thread pool 监控里

#include "gutil/integral_types.h"

namespace doris {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个类要加单测

@@ -269,10 +286,35 @@ Status ThreadPool::init() {
return status;
}
}

_metric_entity = DorisMetrics::instance()->metric_registry()->register_entity(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果有两个workload group,里面都有scan thread pool, 此时register entity的时候,注册2次,但是pool 的名字是相同的,此时是什么行为?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

当一个workload group 删除的时候,调用deregister的时候,把这个entity 删除了,另外一个workload group 是什么行为?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pool name不会相同,pool name 的后缀是 wg 的name

// Get the next token and task to execute.
ThreadPoolToken* token = _queue.front();
_queue.pop_front();
DCHECK_EQ(ThreadPoolToken::State::RUNNING, token->state());
DCHECK(!token->_entries.empty());
Task task = std::move(token->_entries.front());
std::chrono::time_point<std::chrono::system_clock> current =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个方法可能性能很差

return Status::OK();
}

void ThreadPool::shutdown() {
DorisMetrics::instance()->metric_registry()->deregister_entity(_metric_entity);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果先deregister,假如此时还有task run,这个task run 结束之后,更新metric 会不会导致内存写脏?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不会,metric的更新不是 thread pool做的,是当 Metrics 被外部系统请求的时候触发的更新,deregister 之后 map 里面就没有这个 metrics 了

@zhiqiang-hhhh
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 39916 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 3d0859e06809be66920e7651a4fe6b505e50f751, data reload: false

------ Round 1 ----------------------------------
q1	17590	7413	7282	7282
q2	2051	185	168	168
q3	10578	1116	1138	1116
q4	10478	733	770	733
q5	7616	2712	2729	2712
q6	241	153	148	148
q7	1020	647	599	599
q8	9255	1899	1924	1899
q9	6699	6487	6470	6470
q10	7039	2375	2323	2323
q11	470	269	261	261
q12	437	223	218	218
q13	17764	2977	2911	2911
q14	240	220	206	206
q15	552	500	501	500
q16	648	587	585	585
q17	1001	557	618	557
q18	7335	6587	6812	6587
q19	1348	960	947	947
q20	489	184	192	184
q21	4034	3273	3206	3206
q22	373	315	304	304
Total cold run time: 107258 ms
Total hot run time: 39916 ms

----- Round 2, with runtime_filter_mode=off -----
q1	7272	7248	7232	7232
q2	330	238	232	232
q3	2903	2828	2978	2828
q4	2053	1832	1816	1816
q5	5677	5688	5630	5630
q6	238	144	148	144
q7	2280	1798	1841	1798
q8	3392	3538	3509	3509
q9	8973	8967	9026	8967
q10	3627	3608	3552	3552
q11	610	520	509	509
q12	826	622	605	605
q13	12033	3095	3095	3095
q14	311	274	286	274
q15	570	537	516	516
q16	693	645	658	645
q17	1891	1655	1624	1624
q18	8356	7899	7834	7834
q19	1778	1614	1605	1605
q20	2094	1879	1920	1879
q21	5677	5416	5416	5416
q22	643	592	579	579
Total cold run time: 72227 ms
Total hot run time: 60289 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 198366 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 3d0859e06809be66920e7651a4fe6b505e50f751, data reload: false

query1	1292	935	981	935
query2	6240	2366	2449	2366
query3	11116	4851	4898	4851
query4	32721	23481	23697	23481
query5	4016	470	463	463
query6	284	201	218	201
query7	3983	311	308	308
query8	293	251	236	236
query9	9369	2753	2755	2753
query10	471	238	249	238
query11	17998	15574	15381	15381
query12	158	110	107	107
query13	1578	426	422	422
query14	9796	7302	8014	7302
query15	282	197	197	197
query16	8341	514	517	514
query17	1791	638	660	638
query18	2170	327	338	327
query19	376	172	168	168
query20	127	116	116	116
query21	215	112	114	112
query22	4810	4671	4478	4478
query23	35023	33812	33555	33555
query24	10569	2564	2472	2472
query25	621	395	404	395
query26	1216	165	153	153
query27	2746	346	337	337
query28	7852	2524	2505	2505
query29	847	425	420	420
query30	232	151	153	151
query31	1037	842	853	842
query32	104	65	54	54
query33	745	332	305	305
query34	983	530	522	522
query35	929	772	786	772
query36	1149	954	960	954
query37	133	70	73	70
query38	4344	4179	4337	4179
query39	1547	1545	1478	1478
query40	212	110	101	101
query41	45	46	44	44
query42	122	105	103	103
query43	554	509	508	508
query44	1291	851	818	818
query45	193	168	174	168
query46	1192	734	720	720
query47	2068	1925	1985	1925
query48	441	331	339	331
query49	1003	384	387	384
query50	844	400	394	394
query51	7350	7206	7348	7206
query52	105	91	97	91
query53	263	183	187	183
query54	1182	413	431	413
query55	79	77	81	77
query56	270	265	252	252
query57	1311	1181	1200	1181
query58	244	217	220	217
query59	3439	3211	3160	3160
query60	286	250	257	250
query61	144	107	125	107
query62	912	745	759	745
query63	225	190	193	190
query64	3944	717	663	663
query65	3318	3267	3311	3267
query66	781	309	321	309
query67	16397	15611	15553	15553
query68	5087	555	573	555
query69	475	256	263	256
query70	1231	1126	1143	1126
query71	464	247	260	247
query72	6448	4150	4176	4150
query73	780	374	363	363
query74	9956	8867	8976	8867
query75	3472	2721	2710	2710
query76	3741	1167	1081	1081
query77	666	347	300	300
query78	10479	9549	9686	9549
query79	2005	626	620	620
query80	1078	428	440	428
query81	518	244	229	229
query82	642	120	126	120
query83	205	147	143	143
query84	284	74	135	74
query85	1428	320	314	314
query86	439	307	297	297
query87	4426	4531	4310	4310
query88	4195	2233	2227	2227
query89	432	291	292	291
query90	2068	191	189	189
query91	138	104	104	104
query92	66	50	54	50
query93	2082	553	546	546
query94	867	293	256	256
query95	354	258	252	252
query96	644	277	284	277
query97	2884	2692	2706	2692
query98	219	195	200	195
query99	1727	1470	1430	1430
Total cold run time: 305219 ms
Total hot run time: 198366 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 32.67 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 3d0859e06809be66920e7651a4fe6b505e50f751, data reload: false

query1	0.03	0.03	0.03
query2	0.08	0.04	0.03
query3	0.23	0.07	0.06
query4	1.61	0.11	0.10
query5	0.43	0.42	0.39
query6	1.15	0.67	0.65
query7	0.02	0.02	0.02
query8	0.04	0.03	0.03
query9	0.59	0.50	0.51
query10	0.55	0.57	0.57
query11	0.16	0.11	0.10
query12	0.13	0.11	0.12
query13	0.61	0.61	0.59
query14	2.72	2.76	2.82
query15	0.91	0.82	0.82
query16	0.37	0.39	0.39
query17	0.99	1.05	1.03
query18	0.24	0.22	0.20
query19	1.97	1.91	2.02
query20	0.01	0.01	0.01
query21	15.36	0.59	0.59
query22	2.70	2.11	1.84
query23	17.02	1.02	0.90
query24	3.28	2.06	0.79
query25	0.21	0.16	0.08
query26	0.58	0.14	0.14
query27	0.04	0.04	0.04
query28	9.96	1.10	1.06
query29	12.61	3.18	3.20
query30	0.24	0.06	0.06
query31	2.85	0.40	0.38
query32	3.25	0.47	0.46
query33	3.16	3.08	3.11
query34	17.23	4.48	4.52
query35	4.51	4.52	4.58
query36	0.68	0.48	0.47
query37	0.10	0.06	0.06
query38	0.04	0.03	0.04
query39	0.03	0.02	0.03
query40	0.17	0.13	0.12
query41	0.08	0.03	0.02
query42	0.04	0.02	0.02
query43	0.03	0.03	0.03
Total cold run time: 107.01 s
Total hot run time: 32.67 s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants