Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigation: Collect "mean instruction count" metric #689

Open
mdboom opened this issue Jun 26, 2024 · 9 comments
Open

Investigation: Collect "mean instruction count" metric #689

mdboom opened this issue Jun 26, 2024 · 9 comments

Comments

@mdboom
Copy link
Contributor

mdboom commented Jun 26, 2024

As suggested by @markshannon, it could be useful to know for each benchmark the mean number of times each specific instruction is executed to characterize each benchmark better. (Here "specific instruction" means a unique location within a code object, not an bytecode instruction type.)

The metric is:

  1. count the number of times each specific instruction is executed
  2. sum all of these numbers up and divide by the total number of specific instructions that were executed at least once

This, roughly speaking, gives an idea of how "loopy" code is.

Methodology

This uses sys.monitoring for measurement, using a plugin for pyperf that turns on the measurement only around the actual benchmarking code. (The concept of a pyperf plugin exists only in a pull request at the moment).

The data was collected by running with pyperf's --debug-single-value, which runs the benchmark code exactly once (the inner loop only once, and the outer loop that spawns individual processes exactly once). This is partly to reveal which benchmarks have no loops at all, and also because the instrumentation is very slow, so as a practical matter running as little as possible helps get results in a reasonable time (it takes about 3 hours on my laptop, atm).

class instruction_count:
    def __init__(self):
        # Called at startup
        self.instructions = defaultdict(int)
        monitoring.register_callback(
            1, monitoring.events.INSTRUCTION, self.event_handler
        )
        monitoring.use_tool_id(1, "instruction_count")

    def collect_metadata(self, metadata):
        # Called at shutdown
        value = sum(self.instructions.values()) // len(self.instructions)
        metadata["mean_instr_count"] = value
        metadata["instr_count"] = len(self.instructions)

    def event_handler(self, code, instruction_offset):
        # The sys.monitoring INSTRUCTION event handler
        self.instructions[(code, instruction_offset)] += 1

    def __enter__(self):
        # Called before running benchmarking code
        sys.monitoring.set_events(1, monitoring.events.INSTRUCTION)

    def __exit__(self, _exc_type, _exc_value, _traceback):
        # Called after running benchmarking code
        sys.monitoring.set_events(1, 0)

Results

Results, sorted by mean instruction count
name mean instruction count instruction count
deepcopy_reduce 1 384
logging_silent 1 276
pickle 1 260
pickle_dict 1 164
pickle_list 1 192
python_startup 1 1197
python_startup_no_site 1 1197
sqlite_synth 1 322
unpack_sequence 1 4960
unpickle 1 261
unpickle_list 1 185
aiohttp 2 10437
flaskblogging 2 10443
gunicorn 2 10441
djangocms 3 12340
logging_format 8 969
logging_simple 8 921
bench_thread_pool 9 5776
comprehensions 12 418
json_loads 15 302
asyncio_websockets 16 20083
bench_mp_pool 18 5551
deepcopy_memo 27 333
thrift 29 1546
regex_dna 33 2042
typing_runtime_protocols 47 596
regex_effbot 51 3128
regex_v8 55 15982
sympy_integrate 100 27549
deepcopy 105 558
sqlalchemy_imperative 117 6826
sqlalchemy_declarative 245 24749
tornado_http 245 18026
create_gc_cycles 277 220
sympy_sum 290 52200
json 304 294
dask 390 53408
pathlib 438 2580
unpickle_pure_python 467 1869
asyncio_tcp_ssl 473 12662
coverage 483 6329
mako 501 3412
deltablue 504 1394
html5lib 550 12352
pylint 551 106975
genshi_text 570 7044
dulwich_log 696 4295
pickle_pure_python 716 1305
hexiom 800 1718
xml_etree_parse 891 3230
asyncio_tcp 988 5015
genshi_xml 1051 7436
sympy_str 1105 19761
telco 1255 285
async_generators 1388 19193
json_dumps 1853 244
chameleon 1892 684
docutils 1994 64996
sympy_expand 2073 15964
xml_etree_iterparse 2096 3439
xml_etree_process 2186 3860
xml_etree_generate 2991 3171
mypy2 3338 98633
pidigits 3499 262
scimark_sparse_mat_mult 4978 271
regex_compile 5028 3840
django_template 5268 864
pycparser 8694 12235
richards 9794 927
richards_super 10174 970
crypto_pyaes 10197 982
chaos 12955 888
go 18550 1511
gc_traversal 20066 175
coroutines 21939 166
meteor_contest 28833 303
scimark_lu 29979 646
nqueens 34722 359
raytrace 35010 1228
scimark_monte_carlo 40481 358
float 44642 280
generators 52287 222
pyflate 57937 1328
scimark_fft 64011 820
nbody 64806 458
mdp 70366 1743
spectral_norm 79688 273
scimark_sor 100342 293
bpe_tokeniser 162137 3128
pprint_safe_repr 268750 416
fannkuch 271393 272
tomli_loads 286966 1611
pprint_pformat 360345 638
Results, sorted by absolute instructions counts
name mean instruction count instruction count
pickle_dict 1 164
coroutines 21939 166
gc_traversal 20066 175
unpickle_list 1 185
pickle_list 1 192
create_gc_cycles 277 220
generators 52287 222
json_dumps 1853 244
pickle 1 260
unpickle 1 261
pidigits 3499 262
scimark_sparse_mat_mult 4978 271
fannkuch 271393 272
spectral_norm 79688 273
logging_silent 1 276
float 44642 280
telco 1255 285
scimark_sor 100342 293
json 304 294
json_loads 15 302
meteor_contest 28833 303
sqlite_synth 1 322
deepcopy_memo 27 333
scimark_monte_carlo 40481 358
nqueens 34722 359
deepcopy_reduce 1 384
pprint_safe_repr 268750 416
comprehensions 12 418
nbody 64806 458
deepcopy 105 558
typing_runtime_protocols 47 596
pprint_pformat 360345 638
scimark_lu 29979 646
chameleon 1892 684
scimark_fft 64011 820
django_template 5268 864
chaos 12955 888
logging_simple 8 921
richards 9794 927
logging_format 8 969
richards_super 10174 970
crypto_pyaes 10197 982
python_startup 1 1197
python_startup_no_site 1 1197
raytrace 35010 1228
pickle_pure_python 716 1305
pyflate 57937 1328
deltablue 504 1394
go 18550 1511
thrift 29 1546
tomli_loads 286966 1611
hexiom 800 1718
mdp 70366 1743
unpickle_pure_python 467 1869
regex_dna 33 2042
pathlib 438 2580
regex_effbot 51 3128
bpe_tokeniser 162137 3128
xml_etree_generate 2991 3171
xml_etree_parse 891 3230
mako 501 3412
xml_etree_iterparse 2096 3439
regex_compile 5028 3840
xml_etree_process 2186 3860
dulwich_log 696 4295
unpack_sequence 1 4960
asyncio_tcp 988 5015
bench_mp_pool 18 5551
bench_thread_pool 9 5776
coverage 483 6329
sqlalchemy_imperative 117 6826
genshi_text 570 7044
genshi_xml 1051 7436
aiohttp 2 10437
gunicorn 2 10441
flaskblogging 2 10443
pycparser 8694 12235
djangocms 3 12340
html5lib 550 12352
asyncio_tcp_ssl 473 12662
sympy_expand 2073 15964
regex_v8 55 15982
tornado_http 245 18026
async_generators 1388 19193
sympy_str 1105 19761
asyncio_websockets 16 20083
sqlalchemy_declarative 245 24749
sympy_integrate 100 27549
sympy_sum 290 52200
dask 390 53408
docutils 1994 64996
mypy2 3338 98633
pylint 551 106975

A few conclusions to draw from this:

There are a few benchmarks were the mean instruction count is 1, or very low, where there's not much an optimizer could do (short of multiple loops). Some, at least pickle*, unpickle* and sqlite_synth aren't really CPython interpreter benchmarks at all and just drop to C code pretty quickly (and the profiling results confirm this). flaskblogging, gunicorn and djangocms are at least ostenibly macrobenchmarks, so we should investigate why the mean instruction count is so low there. We should consider excluding this class whole class of benchmarks from the global number, at least for purposes of optimizing the interpreter.

Benchmarks with a high mean instruction count and total instruction count feel like really robust examples of real-world code. These include pylint, mypy2, docutils, dask, sympy*, pycparser.

@markshannon
Copy link
Member

If the aim is the count the number of unique instructions executed per process, the event_handler should return sys.monitoring.DISABLE so it sees each instruction only once, that way there is no need for a mapping.

    def event_handler(self, code, instruction_offset):
        # The sys.monitoring INSTRUCTION event handler
        self.count += 1
        return DISABLE

That should allow subsequent iterations to run at close to full speed as well.

@markshannon
Copy link
Member

markshannon commented Jun 28, 2024

If you need the total instructions as well, then a stats build is probably the fastest.
Alternatively, use two different tool IDs, one to count the total unique instructions, as above and another to count the total instructions executed:

   def event_handler2(self, code, instruction_offset):
        self.count_executed += 1

@mdboom
Copy link
Contributor Author

mdboom commented Jul 1, 2024

I see my confusion now is over how monitoring.DISABLE works, and your suggestion is just to improve the performance of the measurement, not any problem with the definition of the metric itself.

Given that, any comments on the results?

@markshannon
Copy link
Member

Just to check that I'm reading this right:
The numbers for spectral norm, 79688 and 273, mean that only 273 unique instructions are executed, but each of those is executed 80k times, for a total of just over 2 million bytecodes executed?

It is indeed puzzling that the numbers for mean instruction count are so low for so many benchmarks.

Before investigating further, it might we worth checking that the numbers stay the same using the faster approach I outlined, as a sanity check.

@mdboom
Copy link
Contributor Author

mdboom commented Jul 8, 2024

The numbers for spectral norm, 79688 and 273, mean that only 273 unique instructions are executed, but each of those is executed 80k times, for a total of just over 2 million bytecodes executed?

Each was executed an average of 80k times, so some probably only once, and some much more than that. Collecting a histogram of execution counts could be interesting (but would require more data collected, of course).

Yeah, I'll try again with the faster approach and see how the numbers compare.

@mdboom
Copy link
Contributor Author

mdboom commented Jul 8, 2024

Some additional findings (from discussion with @markshannon):

Some of these benchmarks start a web server in a separate process, and then just make HTTP requests against it from the main process. (flaskblogging, djangocms, aiohttp, others). Here, the total instruction count and the mean will only capture the main process, which is mostly just waiting on IO, not running Python code. Changes could be made to capture that "real" work if it's useful. This may also be affecting our pystats collection -- I will confirm one way or the other.

Likewise, python_startup_* just times a subprocess, and that subprocess doesn't collect any of this data.

@mdboom
Copy link
Contributor Author

mdboom commented Jul 9, 2024

Using @markshannon's suggested performance improvements, the results are basically the same.

Results, sorted by mean instruction count
name mean instruction count instruction count
deepcopy_reduce 1 366
logging_silent 1 258
pickle_dict 1 146
pickle_list 1 174
python_startup 1 1179
python_startup_no_site 1 1179
sqlite_synth 1 304
unpack_sequence 1 4942
unpickle 1 243
unpickle_list 1 167
aiohttp 2 10419
flaskblogging 2 10425
gunicorn 2 10423
pickle 2 242
djangocms 3 12322
logging_format 8 951
logging_simple 8 903
bench_thread_pool 9 5725
comprehensions 13 400
asyncio_websockets 16 20065
json_loads 16 284
deepcopy_memo 28 315
thrift 29 1528
regex_dna 33 2024
typing_runtime_protocols 47 578
sqlglot_transpile 49 4515
regex_effbot 52 3110
regex_v8 55 15964
sqlglot_parse 71 2794
bench_mp_pool 91 5630
sympy_integrate 101 27531
deepcopy 108 540
sqlalchemy_imperative 117 6808
sqlalchemy_declarative 245 24766
tornado_http 245 18041
sympy_sum 285 53368
create_gc_cycles 301 202
json 324 276
dask 333 52986
pathlib 441 2562
unpickle_pure_python 472 1851
asyncio_tcp_ssl 473 12651
coverage 484 6311
mako 503 3395
deltablue 511 1376
html5lib 551 12334
pylint 551 106961
genshi_text 571 7026
sqlglot_optimize 597 8617
dulwich_log 699 4277
pickle_pure_python 726 1287
hexiom 809 1700
xml_etree_parse 896 3212
asyncio_tcp 992 4997
genshi_xml 1054 7418
sympy_str 1098 19743
mypy2 1119 77867
telco 1340 267
async_generators 1390 19175
chameleon 1943 666
docutils 1993 65031
json_dumps 2001 226
sympy_expand 2079 15946
xml_etree_iterparse 2107 3421
xml_etree_process 2196 3842
xml_etree_generate 3008 3153
sqlglot_normalize 3392 3286
pidigits 3757 244
async_tree_eager_tg 4369 1944
regex_compile 5052 3822
scimark_sparse_mat_mult 5333 253
django_template 5380 846
async_tree_eager_cpu_io_mixed_tg 5754 2370
async_tree_eager 6441 1969
async_tree_eager_memoization_tg 7286 2348
async_tree_eager_cpu_io_mixed 7479 2348
pycparser 8692 12237
async_tree_eager_memoization 8974 2326
richards 9988 909
richards_super 10366 952
crypto_pyaes 10388 964
async_tree_cpu_io_mixed_tg 11849 2275
async_tree_none_tg 11978 1940
async_tree_memoization_tg 13104 2253
chaos 13223 870
async_tree_cpu_io_mixed 13239 2244
async_tree_none 13657 1903
async_tree_memoization 14526 2222
go 18773 1493
async_tree_eager_io_tg 20297 2248
async_tree_eager_io 21793 2220
gc_traversal 22367 157
async_tree_io_tg 23148 2184
coroutines 24607 148
async_tree_io 24760 2153
meteor_contest 30654 285
scimark_lu 30838 628
raytrace 35531 1210
nqueens 36555 341
scimark_monte_carlo 42624 340
float 47710 262
generators 56900 204
pyflate 58733 1310
scimark_fft 65448 802
nbody 67458 440
mdp 71100 1725
spectral_norm 85313 255
scimark_sor 106910 275
bpe_tokeniser 163075 3110
pprint_safe_repr 280905 398
tomli_loads 290209 1593
fannkuch 290626 254
pprint_pformat 370807 620
Results, sorted by instruction count
name mean instruction count instruction count
pickle_dict 1 146
coroutines 24607 148
gc_traversal 22367 157
unpickle_list 1 167
pickle_list 1 174
create_gc_cycles 301 202
generators 56900 204
json_dumps 2001 226
pickle 2 242
unpickle 1 243
pidigits 3757 244
scimark_sparse_mat_mult 5333 253
fannkuch 290626 254
spectral_norm 85313 255
logging_silent 1 258
float 47710 262
telco 1340 267
scimark_sor 106910 275
json 324 276
json_loads 16 284
meteor_contest 30654 285
sqlite_synth 1 304
deepcopy_memo 28 315
scimark_monte_carlo 42624 340
nqueens 36555 341
deepcopy_reduce 1 366
pprint_safe_repr 280905 398
comprehensions 13 400
nbody 67458 440
deepcopy 108 540
typing_runtime_protocols 47 578
pprint_pformat 370807 620
scimark_lu 30838 628
chameleon 1943 666
scimark_fft 65448 802
django_template 5380 846
chaos 13223 870
logging_simple 8 903
richards 9988 909
logging_format 8 951
richards_super 10366 952
crypto_pyaes 10388 964
python_startup 1 1179
python_startup_no_site 1 1179
raytrace 35531 1210
pickle_pure_python 726 1287
pyflate 58733 1310
deltablue 511 1376
go 18773 1493
thrift 29 1528
tomli_loads 290209 1593
hexiom 809 1700
mdp 71100 1725
unpickle_pure_python 472 1851
async_tree_none 13657 1903
async_tree_none_tg 11978 1940
async_tree_eager_tg 4369 1944
async_tree_eager 6441 1969
regex_dna 33 2024
async_tree_io 24760 2153
async_tree_io_tg 23148 2184
async_tree_eager_io 21793 2220
async_tree_memoization 14526 2222
async_tree_cpu_io_mixed 13239 2244
async_tree_eager_io_tg 20297 2248
async_tree_memoization_tg 13104 2253
async_tree_cpu_io_mixed_tg 11849 2275
async_tree_eager_memoization 8974 2326
async_tree_eager_memoization_tg 7286 2348
async_tree_eager_cpu_io_mixed 7479 2348
async_tree_eager_cpu_io_mixed_tg 5754 2370
pathlib 441 2562
sqlglot_parse 71 2794
regex_effbot 52 3110
bpe_tokeniser 163075 3110
xml_etree_generate 3008 3153
xml_etree_parse 896 3212
sqlglot_normalize 3392 3286
mako 503 3395
xml_etree_iterparse 2107 3421
regex_compile 5052 3822
xml_etree_process 2196 3842
dulwich_log 699 4277
sqlglot_transpile 49 4515
unpack_sequence 1 4942
asyncio_tcp 992 4997
bench_mp_pool 91 5630
bench_thread_pool 9 5725
coverage 484 6311
sqlalchemy_imperative 117 6808
genshi_text 571 7026
genshi_xml 1054 7418
sqlglot_optimize 597 8617
aiohttp 2 10419
gunicorn 2 10423
flaskblogging 2 10425
pycparser 8692 12237
djangocms 3 12322
html5lib 551 12334
asyncio_tcp_ssl 473 12651
sympy_expand 2079 15946
regex_v8 55 15964
tornado_http 245 18041
async_generators 1390 19175
sympy_str 1098 19743
asyncio_websockets 16 20065
sqlalchemy_declarative 245 24766
sympy_integrate 101 27531
dask 333 52986
sympy_sum 285 53368
docutils 1993 65031
mypy2 1119 77867
pylint 551 106961

@cmangla-overleaf
Copy link

cmangla-overleaf commented Jul 12, 2024

The metric is:

1. count the number of times each specific instruction is executed

2. sum all of these numbers up and divide by the total number of specific instructions that were executed at least once

This, roughly speaking, gives an idea of how "loopy" code is.

I suggest another summary statistic. Take the counts in step 1 and calculate Shannon entropy of them:
image
Sort the benchmarks by this value. More loopy code should be less "surprising", which is what the entropy should reflect.

For each benchmark, p(x) is probability of instruction x, i.e., count of instuction x in benchmark divided by total instruction count in benchmark. So for each benchmark, its p(x) should sum to 1.

@cmangla
Copy link

cmangla commented Jul 21, 2024

@mdboom Could you post the raw data here? I am just curious because of my general interest in this sort of analysis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants