Investigation: Collect "mean instruction count" metric #689

mdboom · 2024-06-26T17:53:34Z

As suggested by @markshannon, it could be useful to know for each benchmark the mean number of times each specific instruction is executed to characterize each benchmark better. (Here "specific instruction" means a unique location within a code object, not an bytecode instruction type.)

The metric is:

count the number of times each specific instruction is executed
sum all of these numbers up and divide by the total number of specific instructions that were executed at least once

This, roughly speaking, gives an idea of how "loopy" code is.

Methodology

This uses sys.monitoring for measurement, using a plugin for pyperf that turns on the measurement only around the actual benchmarking code. (The concept of a pyperf plugin exists only in a pull request at the moment).

The data was collected by running with pyperf's --debug-single-value, which runs the benchmark code exactly once (the inner loop only once, and the outer loop that spawns individual processes exactly once). This is partly to reveal which benchmarks have no loops at all, and also because the instrumentation is very slow, so as a practical matter running as little as possible helps get results in a reasonable time (it takes about 3 hours on my laptop, atm).

class instruction_count:
    def __init__(self):
        # Called at startup
        self.instructions = defaultdict(int)
        monitoring.register_callback(
            1, monitoring.events.INSTRUCTION, self.event_handler
        )
        monitoring.use_tool_id(1, "instruction_count")

    def collect_metadata(self, metadata):
        # Called at shutdown
        value = sum(self.instructions.values()) // len(self.instructions)
        metadata["mean_instr_count"] = value
        metadata["instr_count"] = len(self.instructions)

    def event_handler(self, code, instruction_offset):
        # The sys.monitoring INSTRUCTION event handler
        self.instructions[(code, instruction_offset)] += 1

    def __enter__(self):
        # Called before running benchmarking code
        sys.monitoring.set_events(1, monitoring.events.INSTRUCTION)

    def __exit__(self, _exc_type, _exc_value, _traceback):
        # Called after running benchmarking code
        sys.monitoring.set_events(1, 0)

Results

Results, sorted by mean instruction count

name	mean instruction count	instruction count
deepcopy_reduce	1	384
logging_silent	1	276
pickle	1	260
pickle_dict	1	164
pickle_list	1	192
python_startup	1	1197
python_startup_no_site	1	1197
sqlite_synth	1	322
unpack_sequence	1	4960
unpickle	1	261
unpickle_list	1	185
aiohttp	2	10437
flaskblogging	2	10443
gunicorn	2	10441
djangocms	3	12340
logging_format	8	969
logging_simple	8	921
bench_thread_pool	9	5776
comprehensions	12	418
json_loads	15	302
asyncio_websockets	16	20083
bench_mp_pool	18	5551
deepcopy_memo	27	333
thrift	29	1546
regex_dna	33	2042
typing_runtime_protocols	47	596
regex_effbot	51	3128
regex_v8	55	15982
sympy_integrate	100	27549
deepcopy	105	558
sqlalchemy_imperative	117	6826
sqlalchemy_declarative	245	24749
tornado_http	245	18026
create_gc_cycles	277	220
sympy_sum	290	52200
json	304	294
dask	390	53408
pathlib	438	2580
unpickle_pure_python	467	1869
asyncio_tcp_ssl	473	12662
coverage	483	6329
mako	501	3412
deltablue	504	1394
html5lib	550	12352
pylint	551	106975
genshi_text	570	7044
dulwich_log	696	4295
pickle_pure_python	716	1305
hexiom	800	1718
xml_etree_parse	891	3230
asyncio_tcp	988	5015
genshi_xml	1051	7436
sympy_str	1105	19761
telco	1255	285
async_generators	1388	19193
json_dumps	1853	244
chameleon	1892	684
docutils	1994	64996
sympy_expand	2073	15964
xml_etree_iterparse	2096	3439
xml_etree_process	2186	3860
xml_etree_generate	2991	3171
mypy2	3338	98633
pidigits	3499	262
scimark_sparse_mat_mult	4978	271
regex_compile	5028	3840
django_template	5268	864
pycparser	8694	12235
richards	9794	927
richards_super	10174	970
crypto_pyaes	10197	982
chaos	12955	888
go	18550	1511
gc_traversal	20066	175
coroutines	21939	166
meteor_contest	28833	303
scimark_lu	29979	646
nqueens	34722	359
raytrace	35010	1228
scimark_monte_carlo	40481	358
float	44642	280
generators	52287	222
pyflate	57937	1328
scimark_fft	64011	820
nbody	64806	458
mdp	70366	1743
spectral_norm	79688	273
scimark_sor	100342	293
bpe_tokeniser	162137	3128
pprint_safe_repr	268750	416
fannkuch	271393	272
tomli_loads	286966	1611
pprint_pformat	360345	638

Results, sorted by absolute instructions counts

name	mean instruction count	instruction count
pickle_dict	1	164
coroutines	21939	166
gc_traversal	20066	175
unpickle_list	1	185
pickle_list	1	192
create_gc_cycles	277	220
generators	52287	222
json_dumps	1853	244
pickle	1	260
unpickle	1	261
pidigits	3499	262
scimark_sparse_mat_mult	4978	271
fannkuch	271393	272
spectral_norm	79688	273
logging_silent	1	276
float	44642	280
telco	1255	285
scimark_sor	100342	293
json	304	294
json_loads	15	302
meteor_contest	28833	303
sqlite_synth	1	322
deepcopy_memo	27	333
scimark_monte_carlo	40481	358
nqueens	34722	359
deepcopy_reduce	1	384
pprint_safe_repr	268750	416
comprehensions	12	418
nbody	64806	458
deepcopy	105	558
typing_runtime_protocols	47	596
pprint_pformat	360345	638
scimark_lu	29979	646
chameleon	1892	684
scimark_fft	64011	820
django_template	5268	864
chaos	12955	888
logging_simple	8	921
richards	9794	927
logging_format	8	969
richards_super	10174	970
crypto_pyaes	10197	982
python_startup	1	1197
python_startup_no_site	1	1197
raytrace	35010	1228
pickle_pure_python	716	1305
pyflate	57937	1328
deltablue	504	1394
go	18550	1511
thrift	29	1546
tomli_loads	286966	1611
hexiom	800	1718
mdp	70366	1743
unpickle_pure_python	467	1869
regex_dna	33	2042
pathlib	438	2580
regex_effbot	51	3128
bpe_tokeniser	162137	3128
xml_etree_generate	2991	3171
xml_etree_parse	891	3230
mako	501	3412
xml_etree_iterparse	2096	3439
regex_compile	5028	3840
xml_etree_process	2186	3860
dulwich_log	696	4295
unpack_sequence	1	4960
asyncio_tcp	988	5015
bench_mp_pool	18	5551
bench_thread_pool	9	5776
coverage	483	6329
sqlalchemy_imperative	117	6826
genshi_text	570	7044
genshi_xml	1051	7436
aiohttp	2	10437
gunicorn	2	10441
flaskblogging	2	10443
pycparser	8694	12235
djangocms	3	12340
html5lib	550	12352
asyncio_tcp_ssl	473	12662
sympy_expand	2073	15964
regex_v8	55	15982
tornado_http	245	18026
async_generators	1388	19193
sympy_str	1105	19761
asyncio_websockets	16	20083
sqlalchemy_declarative	245	24749
sympy_integrate	100	27549
sympy_sum	290	52200
dask	390	53408
docutils	1994	64996
mypy2	3338	98633
pylint	551	106975

A few conclusions to draw from this:

There are a few benchmarks were the mean instruction count is 1, or very low, where there's not much an optimizer could do (short of multiple loops). Some, at least pickle*, unpickle* and sqlite_synth aren't really CPython interpreter benchmarks at all and just drop to C code pretty quickly (and the profiling results confirm this). flaskblogging, gunicorn and djangocms are at least ostenibly macrobenchmarks, so we should investigate why the mean instruction count is so low there. We should consider excluding this class whole class of benchmarks from the global number, at least for purposes of optimizing the interpreter.

Benchmarks with a high mean instruction count and total instruction count feel like really robust examples of real-world code. These include pylint, mypy2, docutils, dask, sympy*, pycparser.

The text was updated successfully, but these errors were encountered:

markshannon · 2024-06-28T18:05:23Z

If the aim is the count the number of unique instructions executed per process, the event_handler should return sys.monitoring.DISABLE so it sees each instruction only once, that way there is no need for a mapping.

    def event_handler(self, code, instruction_offset):
        # The sys.monitoring INSTRUCTION event handler
        self.count += 1
        return DISABLE

That should allow subsequent iterations to run at close to full speed as well.

markshannon · 2024-06-28T18:08:11Z

If you need the total instructions as well, then a stats build is probably the fastest.
Alternatively, use two different tool IDs, one to count the total unique instructions, as above and another to count the total instructions executed:

   def event_handler2(self, code, instruction_offset):
        self.count_executed += 1

mdboom · 2024-07-01T13:22:11Z

I see my confusion now is over how monitoring.DISABLE works, and your suggestion is just to improve the performance of the measurement, not any problem with the definition of the metric itself.

Given that, any comments on the results?

markshannon · 2024-07-08T08:50:28Z

Just to check that I'm reading this right:
The numbers for spectral norm, 79688 and 273, mean that only 273 unique instructions are executed, but each of those is executed 80k times, for a total of just over 2 million bytecodes executed?

It is indeed puzzling that the numbers for mean instruction count are so low for so many benchmarks.

Before investigating further, it might we worth checking that the numbers stay the same using the faster approach I outlined, as a sanity check.

mdboom · 2024-07-08T11:33:36Z

The numbers for spectral norm, 79688 and 273, mean that only 273 unique instructions are executed, but each of those is executed 80k times, for a total of just over 2 million bytecodes executed?

Each was executed an average of 80k times, so some probably only once, and some much more than that. Collecting a histogram of execution counts could be interesting (but would require more data collected, of course).

Yeah, I'll try again with the faster approach and see how the numbers compare.

mdboom · 2024-07-08T14:50:17Z

Some additional findings (from discussion with @markshannon):

Some of these benchmarks start a web server in a separate process, and then just make HTTP requests against it from the main process. (flaskblogging, djangocms, aiohttp, others). Here, the total instruction count and the mean will only capture the main process, which is mostly just waiting on IO, not running Python code. Changes could be made to capture that "real" work if it's useful. This may also be affecting our pystats collection -- I will confirm one way or the other.

Likewise, python_startup_* just times a subprocess, and that subprocess doesn't collect any of this data.

mdboom · 2024-07-09T13:46:12Z

Using @markshannon's suggested performance improvements, the results are basically the same.

Results, sorted by mean instruction count

name	mean instruction count	instruction count
deepcopy_reduce	1	366
logging_silent	1	258
pickle_dict	1	146
pickle_list	1	174
python_startup	1	1179
python_startup_no_site	1	1179
sqlite_synth	1	304
unpack_sequence	1	4942
unpickle	1	243
unpickle_list	1	167
aiohttp	2	10419
flaskblogging	2	10425
gunicorn	2	10423
pickle	2	242
djangocms	3	12322
logging_format	8	951
logging_simple	8	903
bench_thread_pool	9	5725
comprehensions	13	400
asyncio_websockets	16	20065
json_loads	16	284
deepcopy_memo	28	315
thrift	29	1528
regex_dna	33	2024
typing_runtime_protocols	47	578
sqlglot_transpile	49	4515
regex_effbot	52	3110
regex_v8	55	15964
sqlglot_parse	71	2794
bench_mp_pool	91	5630
sympy_integrate	101	27531
deepcopy	108	540
sqlalchemy_imperative	117	6808
sqlalchemy_declarative	245	24766
tornado_http	245	18041
sympy_sum	285	53368
create_gc_cycles	301	202
json	324	276
dask	333	52986
pathlib	441	2562
unpickle_pure_python	472	1851
asyncio_tcp_ssl	473	12651
coverage	484	6311
mako	503	3395
deltablue	511	1376
html5lib	551	12334
pylint	551	106961
genshi_text	571	7026
sqlglot_optimize	597	8617
dulwich_log	699	4277
pickle_pure_python	726	1287
hexiom	809	1700
xml_etree_parse	896	3212
asyncio_tcp	992	4997
genshi_xml	1054	7418
sympy_str	1098	19743
mypy2	1119	77867
telco	1340	267
async_generators	1390	19175
chameleon	1943	666
docutils	1993	65031
json_dumps	2001	226
sympy_expand	2079	15946
xml_etree_iterparse	2107	3421
xml_etree_process	2196	3842
xml_etree_generate	3008	3153
sqlglot_normalize	3392	3286
pidigits	3757	244
async_tree_eager_tg	4369	1944
regex_compile	5052	3822
scimark_sparse_mat_mult	5333	253
django_template	5380	846
async_tree_eager_cpu_io_mixed_tg	5754	2370
async_tree_eager	6441	1969
async_tree_eager_memoization_tg	7286	2348
async_tree_eager_cpu_io_mixed	7479	2348
pycparser	8692	12237
async_tree_eager_memoization	8974	2326
richards	9988	909
richards_super	10366	952
crypto_pyaes	10388	964
async_tree_cpu_io_mixed_tg	11849	2275
async_tree_none_tg	11978	1940
async_tree_memoization_tg	13104	2253
chaos	13223	870
async_tree_cpu_io_mixed	13239	2244
async_tree_none	13657	1903
async_tree_memoization	14526	2222
go	18773	1493
async_tree_eager_io_tg	20297	2248
async_tree_eager_io	21793	2220
gc_traversal	22367	157
async_tree_io_tg	23148	2184
coroutines	24607	148
async_tree_io	24760	2153
meteor_contest	30654	285
scimark_lu	30838	628
raytrace	35531	1210
nqueens	36555	341
scimark_monte_carlo	42624	340
float	47710	262
generators	56900	204
pyflate	58733	1310
scimark_fft	65448	802
nbody	67458	440
mdp	71100	1725
spectral_norm	85313	255
scimark_sor	106910	275
bpe_tokeniser	163075	3110
pprint_safe_repr	280905	398
tomli_loads	290209	1593
fannkuch	290626	254
pprint_pformat	370807	620

Results, sorted by instruction count

name	mean instruction count	instruction count
pickle_dict	1	146
coroutines	24607	148
gc_traversal	22367	157
unpickle_list	1	167
pickle_list	1	174
create_gc_cycles	301	202
generators	56900	204
json_dumps	2001	226
pickle	2	242
unpickle	1	243
pidigits	3757	244
scimark_sparse_mat_mult	5333	253
fannkuch	290626	254
spectral_norm	85313	255
logging_silent	1	258
float	47710	262
telco	1340	267
scimark_sor	106910	275
json	324	276
json_loads	16	284
meteor_contest	30654	285
sqlite_synth	1	304
deepcopy_memo	28	315
scimark_monte_carlo	42624	340
nqueens	36555	341
deepcopy_reduce	1	366
pprint_safe_repr	280905	398
comprehensions	13	400
nbody	67458	440
deepcopy	108	540
typing_runtime_protocols	47	578
pprint_pformat	370807	620
scimark_lu	30838	628
chameleon	1943	666
scimark_fft	65448	802
django_template	5380	846
chaos	13223	870
logging_simple	8	903
richards	9988	909
logging_format	8	951
richards_super	10366	952
crypto_pyaes	10388	964
python_startup	1	1179
python_startup_no_site	1	1179
raytrace	35531	1210
pickle_pure_python	726	1287
pyflate	58733	1310
deltablue	511	1376
go	18773	1493
thrift	29	1528
tomli_loads	290209	1593
hexiom	809	1700
mdp	71100	1725
unpickle_pure_python	472	1851
async_tree_none	13657	1903
async_tree_none_tg	11978	1940
async_tree_eager_tg	4369	1944
async_tree_eager	6441	1969
regex_dna	33	2024
async_tree_io	24760	2153
async_tree_io_tg	23148	2184
async_tree_eager_io	21793	2220
async_tree_memoization	14526	2222
async_tree_cpu_io_mixed	13239	2244
async_tree_eager_io_tg	20297	2248
async_tree_memoization_tg	13104	2253
async_tree_cpu_io_mixed_tg	11849	2275
async_tree_eager_memoization	8974	2326
async_tree_eager_memoization_tg	7286	2348
async_tree_eager_cpu_io_mixed	7479	2348
async_tree_eager_cpu_io_mixed_tg	5754	2370
pathlib	441	2562
sqlglot_parse	71	2794
regex_effbot	52	3110
bpe_tokeniser	163075	3110
xml_etree_generate	3008	3153
xml_etree_parse	896	3212
sqlglot_normalize	3392	3286
mako	503	3395
xml_etree_iterparse	2107	3421
regex_compile	5052	3822
xml_etree_process	2196	3842
dulwich_log	699	4277
sqlglot_transpile	49	4515
unpack_sequence	1	4942
asyncio_tcp	992	4997
bench_mp_pool	91	5630
bench_thread_pool	9	5725
coverage	484	6311
sqlalchemy_imperative	117	6808
genshi_text	571	7026
genshi_xml	1054	7418
sqlglot_optimize	597	8617
aiohttp	2	10419
gunicorn	2	10423
flaskblogging	2	10425
pycparser	8692	12237
djangocms	3	12322
html5lib	551	12334
asyncio_tcp_ssl	473	12651
sympy_expand	2079	15946
regex_v8	55	15964
tornado_http	245	18041
async_generators	1390	19175
sympy_str	1098	19743
asyncio_websockets	16	20065
sqlalchemy_declarative	245	24766
sympy_integrate	101	27531
dask	333	52986
sympy_sum	285	53368
docutils	1993	65031
mypy2	1119	77867
pylint	551	106961

cmangla-overleaf · 2024-07-12T13:02:27Z

The metric is:

1. count the number of times each specific instruction is executed

2. sum all of these numbers up and divide by the total number of specific instructions that were executed at least once

This, roughly speaking, gives an idea of how "loopy" code is.

I suggest another summary statistic. Take the counts in step 1 and calculate Shannon entropy of them:

Sort the benchmarks by this value. More loopy code should be less "surprising", which is what the entropy should reflect.

For each benchmark, p(x) is probability of instruction x, i.e., count of instuction x in benchmark divided by total instruction count in benchmark. So for each benchmark, its p(x) should sum to 1.

cmangla · 2024-07-21T10:34:57Z

@mdboom Could you post the raw data here? I am just curious because of my general interest in this sort of analysis.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigation: Collect "mean instruction count" metric #689

Investigation: Collect "mean instruction count" metric #689

mdboom commented Jun 26, 2024

markshannon commented Jun 28, 2024

markshannon commented Jun 28, 2024 •

edited

Loading

mdboom commented Jul 1, 2024

markshannon commented Jul 8, 2024

mdboom commented Jul 8, 2024

mdboom commented Jul 8, 2024

mdboom commented Jul 9, 2024

cmangla-overleaf commented Jul 12, 2024 •

edited

Loading

cmangla commented Jul 21, 2024

Investigation: Collect "mean instruction count" metric #689

Investigation: Collect "mean instruction count" metric #689

Comments

mdboom commented Jun 26, 2024

Methodology

Results

markshannon commented Jun 28, 2024

markshannon commented Jun 28, 2024 • edited Loading

mdboom commented Jul 1, 2024

markshannon commented Jul 8, 2024

mdboom commented Jul 8, 2024

mdboom commented Jul 8, 2024

mdboom commented Jul 9, 2024

cmangla-overleaf commented Jul 12, 2024 • edited Loading

cmangla commented Jul 21, 2024

markshannon commented Jun 28, 2024 •

edited

Loading

cmangla-overleaf commented Jul 12, 2024 •

edited

Loading