Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
# Introduction Issue #502 claims that under certain circumstances, spikes walking can become a bottleneck. L2 cache misses are posited as the root cause. The setup required for this is pretty extreme though. # Baseline Selected to show the problem while being feasible to run on a laptop. Input ```json { "name": "test", "num-cells": 256, "duration": 400, "min-delay": 10, "fan-in": 10000, "realtime-ratio": 0.01, "spike-frequency": 40, "threads": 4, "ranks": 8000 } ``` Timing ``` ❯ hyperfine 'bin/drybench ~/src/arbor/example/drybench/params.json' Benchmark 1: bin/drybench ~/src/arbor/example/drybench/params.json Time (mean ± σ): 4.334 s ± 0.146 s [User: 6.026 s, System: 0.418 s] Range (min … max): 4.148 s … 4.667 s 10 runs ``` # Changes - Store connection list as structure of arrays instead of array of structures. This reduces cache misses during binary search for the correct source. - Use `lower bound` instead of `equal range`. Removes one binary search, which is cache-unfriendly. - Treat all spikes from the same source in one go. Keeps all values around instead of discarding and re-acquiring. - Swap member order in `spike_event` reduces size from 24 to 16 bytes. # Outcome We get some minor reduction in runtime on my local machine ``` ❯ hyperfine 'bin/drybench ../example/drybench/params.json' Benchmark 1: bin/drybench ../example/drybench/params.json Time (mean ± σ): 4.225 s ± 0.167 s [User: 5.939 s, System: 0.397 s] Range (min … max): 4.064 s … 4.632 s 10 runs ``` 4.064s vs 4.148s, but this is still within one $\sigma$. # Routes not taken ## Using a faster `sort` Tried `pdqsort`, but sorting isn't a real bottleneck and/or `pdq` doesn't improve on our problem. ## Using a faster `lower_bound` Similarly to `sort` no variation of `lower_bound` improves measurably. We can conclude that the enqueuing is bound by the actual pushing of events into cells' queues. As noted in #502, L2 cache misses are a probable root cause. ## Building a temporary buffer of events Tried to create a scratch space to dump all events into, keyed by their queue index. To avoid allocations, the scratch buffer Then, we sort this by index and build the queues in one go. Proved a significant slow-down. ## SoA-Splitting Spikes Doesn't improve our main bottleneck and `spike` is pretty small (=not much waste on a cacheline) already. ## Multithreading the appends That'll only worsen the amount of cache misses as L2 is shared between threads. Also if cache misses are our problem, this won't address the root cause. ## Adding a pre-filter to spike processing Reducing the incoming spikes by tracking all sources terminating at the local process didn't yield an improvement while rejecting ~25% of all incoming event. Instead, a significant slow-down was observed. # Juwels Booster ## Input Deck and Configuration - JuwelsBooster develbooster queue - CMake ``` cmake .. -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DARB_USE_BUNDLED_LIBS=ON -DARB_VECTORIZE=ON -DARB_WITH_PYTHON=OFF -DARB_WITH_MPI=ON -DBUILD_SHARED_LIBS=OFF -DCMAKE_INTERPROCEDURAL_OPTIMIZATION=ON -DARB_PROFILIING=ON -DCMAKE_INSTALL_PREFIX=../install -G Ninja -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DARB_GPU=cuda -DCMAKE_BUILD_TYPE=release ``` - Input ```json { "name": "test", "num-cells": 8000, "duration": 200, "min-delay": 10, "fan-in": 5000, "realtime-ratio": 0.01, "spike-frequency": 50, "threads": 4, "ranks": 10000 } ``` ## Validation ``` benchmark parameters: name: test cells per rank: 8000 duration: 200 ms fan in: 5000 connections/cell min delay: 10 ms spike freq: 50 Hz cell overhead: 0.01 ms to advance 1 ms expected: cell advance: 16 s spikes: 800000000 events: 3204710400 spikes: 2000 per interval events: 151969 per cell per interval HW resources: threads: 4 ranks: 10000 ``` and ``` 808110000 spikes generated at rate of 10 spikes per cell ``` ## Baseline ### Summary ``` ---- meters ------------------------------------------------------------------------------- meter time(s) memory(MB) ------------------------------------------------------------------------------------------- model-init 113.175 288.951 model-run 107.741 1306.015 meter-total 220.916 1594.966 ``` ### Profiler ``` REGION CALLS WALL THREAD % root - 78.541 314.163 100.0 communication - 36.991 147.964 47.1 enqueue - 21.247 84.986 27.1 sort 320000 19.952 79.807 25.4 merge 320000 1.174 4.695 1.5 setup 320000 0.121 0.483 0.2 walkspikes 40 15.735 62.940 20.0 exchange - 0.010 0.038 0.0 sort 40 0.006 0.022 0.0 gatherlocal 40 0.004 0.016 0.0 gather 40 0.000 0.000 0.0 remote 40 0.000 0.000 0.0 post_process 40 0.000 0.000 0.0 ``` ### Perf ``` Performance counter stats for 'bin/drybench ../example/drybench/params.json': 245572.30 msec task-clock:u # 1.000 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 29077724 page-faults:u # 118.408 K/sec 730063012768 cycles:u # 2.973 GHz (83.33%) 25160708320 stalled-cycles-frontend:u # 3.45% frontend cycles idle (83.33%) 332882140632 stalled-cycles-backend:u # 45.60% backend cycles idle (83.33%) 1080323993228 instructions:u # 1.48 insn per cycle # 0.31 stalled cycles per insn (83.33%) 243594581289 branches:u # 991.946 M/sec (83.33%) 3447791949 branch-misses:u # 1.42% of all branches (83.33%) 245.609154303 seconds time elapsed 218.071863000 seconds user 25.114123000 seconds sys ``` ## Feature Branch ### Summary ``` ---- meters ------------------------------------------------------------------------------- meter time(s) memory(MB) ------------------------------------------------------------------------------------------- model-init 112.901 939.580 model-run 84.730 871.134 meter-total 197.631 1810.714 ``` ### Profiler ``` REGION CALLS WALL THREAD % root - 71.717 286.869 100.0 communication - 30.408 121.633 42.4 enqueue - 20.006 80.023 27.9 sort 320000 18.984 75.938 26.5 merge 320000 0.937 3.746 1.3 setup 320000 0.085 0.340 0.1 walkspikes 40 10.401 41.605 14.5 exchange - 0.001 0.005 0.0 sort 40 0.001 0.004 0.0 gatherlocal 40 0.000 0.001 0.0 gather 40 0.000 0.000 0.0 remote 40 0.000 0.000 0.0 post_process 40 0.000 0.000 0.0 spikeio 40 0.000 0.000 0.0 ``` ### Perf ``` Performance counter stats for 'bin/drybench /p/project/cslns/hater1/arbor/example/drybench/params.json': 221852.22 msec task-clock:u # 1.000 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 28014394 page-faults:u # 126.275 K/sec 658832257282 cycles:u # 2.970 GHz (83.33%) 49927962696 stalled-cycles-frontend:u # 7.58% frontend cycles idle (83.33%) 285682904743 stalled-cycles-backend:u # 43.36% backend cycles idle (83.33%) 943000553536 instructions:u # 1.43 insn per cycle # 0.30 stalled cycles per insn (83.33%) 212435336483 branches:u # 957.553 M/sec (83.33%) 3372189010 branch-misses:u # 1.59% of all branches (83.33%) 221.907422663 seconds time elapsed 195.279197000 seconds user 24.508514000 seconds sys ``` ## Sorting events by `time` only w/ pdqsort ``` ---- meters ------------------------------------------------------------------------------- meter time(s) memory(MB) ------------------------------------------------------------------------------------------- model-init 111.256 939.580 model-run 79.065 871.150 meter-total 190.321 1810.730 ``` with `util::sort_by` ``` ---- meters ------------------------------------------------------------------------------- meter time(s) memory(MB) ------------------------------------------------------------------------------------------- model-init 111.666 939.581 model-run 78.900 871.131 meter-total 190.565 1810.712 ``` ### Profiler ``` REGION CALLS WALL THREAD % root - 68.611 274.442 100.0 communication - 27.576 110.306 40.2 enqueue - 17.728 70.912 25.8 sort 320000 16.853 67.410 24.6 merge 320000 0.808 3.231 1.2 setup 320000 0.068 0.270 0.1 walkspikes 40 9.848 39.391 14.4 ``` # Conclusion We find a 30% decrease in time spent on spike walking and a 10% decrease end-to-end. Note that this case is extremely heavy on the communication part and advances cells at 100x _realtime_ ie 1ms wallclock for 100ms biological time which is far beyond any cable cell we encounter in the wild. By just sorting on the `time` field -- we don't care about the ordering beyond that -- we find another 5%. ## Side note: Memory measurement The memory measurement isn't trustworthy, the 290MB reported above for the baseline turned into 1300MB on repeating the benchmark. # Related Issues Closes #502
- Loading branch information