📊 Improve spike delivery #2222

thorstenhater · 2023-09-15T09:40:32Z

Introduction

Issue #502 claims that under certain circumstances, spikes walking can become a bottleneck.
L2 cache misses are posited as the root cause. The setup required for this is pretty extreme
though.

Baseline

Selected to show the problem while being feasible to run on a laptop.

Input

{
    "name": "test",
    "num-cells": 256,
    "duration": 400,
    "min-delay": 10,
    "fan-in": 10000,
    "realtime-ratio": 0.01,
    "spike-frequency": 40,
    "threads": 4,
    "ranks": 8000
}

Timing

❯ hyperfine 'bin/drybench ~/src/arbor/example/drybench/params.json'
Benchmark 1: bin/drybench ~/src/arbor/example/drybench/params.json
  Time (mean ± σ):      4.334 s ±  0.146 s    [User: 6.026 s, System: 0.418 s]
  Range (min … max):    4.148 s …  4.667 s    10 runs

Changes

Store connection list as structure of arrays instead of array of structures. This reduces cache misses during binary search for the correct source.
Use lower bound instead of equal range. Removes one binary search, which is cache-unfriendly.
Treat all spikes from the same source in one go. Keeps all values around instead of discarding and re-acquiring.
Swap member order in spike_event reduces size from 24 to 16 bytes.

Outcome

We get some minor reduction in runtime on my local machine

❯ hyperfine 'bin/drybench ../example/drybench/params.json'
Benchmark 1: bin/drybench ../example/drybench/params.json
  Time (mean ± σ):      4.225 s ±  0.167 s    [User: 5.939 s, System: 0.397 s]
  Range (min … max):    4.064 s …  4.632 s    10 runs

4.064s vs 4.148s, but this is still within one $\sigma$.

Routes not taken

Using a faster `sort`

Tried pdqsort, but sorting isn't a real bottleneck and/or pdq doesn't
improve on our problem.

Using a faster `lower_bound`

Similarly to sort no variation of lower_bound improves measurably.
We can conclude that the enqueuing is bound by the actual pushing of
events into cells' queues. As noted in #502, L2 cache misses are a probable
root cause.

Building a temporary buffer of events

Tried to create a scratch space to dump all events into, keyed by their queue index.
To avoid allocations, the scratch buffer
Then, we sort this by index and build the queues in one go. Proved a significant
slow-down.

SoA-Splitting Spikes

Doesn't improve our main bottleneck and spike is pretty small (=not much
waste on a cacheline) already.

Multithreading the appends

That'll only worsen the amount of cache misses as L2 is shared between threads.
Also if cache misses are our problem, this won't address the root cause.

Adding a pre-filter to spike processing

Reducing the incoming spikes by tracking all sources terminating at the local process
didn't yield an improvement while rejecting ~25% of all incoming event. Instead, a significant
slow-down was observed.

Juwels Booster

Input Deck and Configuration

JuwelsBooster develbooster queue

CMake

cmake .. -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DARB_USE_BUNDLED_LIBS=ON -DARB_VECTORIZE=ON -DARB_WITH_PYTHON=OFF -DARB_WITH_MPI=ON -DBUILD_SHARED_LIBS=OFF -DCMAKE_INTERPROCEDURAL_OPTIMIZATION=ON -DARB_PROFILIING=ON -DCMAKE_INSTALL_PREFIX=../install -G Ninja -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DARB_GPU=cuda -DCMAKE_BUILD_TYPE=release

Input

{
  "name": "test",
  "num-cells": 8000,
  "duration": 200,
  "min-delay": 10,
  "fan-in": 5000,
  "realtime-ratio": 0.01,
  "spike-frequency": 50,
  "threads": 4,
  "ranks": 10000
}

Validation

benchmark parameters:
  name:           test
  cells per rank: 8000
  duration:       200 ms
  fan in:         5000 connections/cell
  min delay:      10 ms
  spike freq:     50 Hz
  cell overhead:  0.01 ms to advance 1 ms
expected:
  cell advance:   16 s
  spikes:         800000000
  events:         3204710400
  spikes:         2000 per interval
  events:         151969 per cell per interval
HW resources:
  threads:        4
  ranks:          10000

and

808110000 spikes generated at rate of 10 spikes per cell

Baseline

Summary

---- meters -------------------------------------------------------------------------------
meter                         time(s)      memory(MB)
-------------------------------------------------------------------------------------------
model-init                    113.175         288.951
model-run                     107.741        1306.015
meter-total                   220.916        1594.966

Profiler

REGION                          CALLS      WALL     THREAD        %
root                                -    78.541    314.163    100.0
  communication                     -    36.991    147.964     47.1
    enqueue                         -    21.247     84.986     27.1
      sort                     320000    19.952     79.807     25.4
      merge                    320000     1.174      4.695      1.5
      setup                    320000     0.121      0.483      0.2
    walkspikes                     40    15.735     62.940     20.0
    exchange                        -     0.010      0.038      0.0
      sort                         40     0.006      0.022      0.0
      gatherlocal                  40     0.004      0.016      0.0
      gather                       40     0.000      0.000      0.0
        remote                     40     0.000      0.000      0.0
          post_process             40     0.000      0.000      0.0

Perf

 Performance counter stats for 'bin/drybench ../example/drybench/params.json':

         245572.30 msec task-clock:u              #    1.000 CPUs utilized
                 0      context-switches:u        #    0.000 /sec
                 0      cpu-migrations:u          #    0.000 /sec
          29077724      page-faults:u             #  118.408 K/sec
      730063012768      cycles:u                  #    2.973 GHz                      (83.33%)
       25160708320      stalled-cycles-frontend:u #    3.45% frontend cycles idle     (83.33%)
      332882140632      stalled-cycles-backend:u  #   45.60% backend cycles idle      (83.33%)
     1080323993228      instructions:u            #    1.48  insn per cycle
                                                  #    0.31  stalled cycles per insn  (83.33%)
      243594581289      branches:u                #  991.946 M/sec                    (83.33%)
        3447791949      branch-misses:u           #    1.42% of all branches          (83.33%)

     245.609154303 seconds time elapsed

     218.071863000 seconds user
      25.114123000 seconds sys

Feature Branch

Summary

---- meters -------------------------------------------------------------------------------
meter                         time(s)      memory(MB)
-------------------------------------------------------------------------------------------
model-init                    112.901         939.580
model-run                      84.730         871.134
meter-total                   197.631        1810.714

Profiler

REGION                              CALLS      WALL     THREAD        %
root                                    -    71.717    286.869    100.0
  communication                         -    30.408    121.633     42.4
    enqueue                             -    20.006     80.023     27.9
      sort                         320000    18.984     75.938     26.5
      merge                        320000     0.937      3.746      1.3
      setup                        320000     0.085      0.340      0.1
    walkspikes                         40    10.401     41.605     14.5
    exchange                            -     0.001      0.005      0.0
      sort                             40     0.001      0.004      0.0
      gatherlocal                      40     0.000      0.001      0.0
      gather                           40     0.000      0.000      0.0
        remote                         40     0.000      0.000      0.0
          post_process                 40     0.000      0.000      0.0
    spikeio                            40     0.000      0.000      0.0

Perf

 Performance counter stats for 'bin/drybench /p/project/cslns/hater1/arbor/example/drybench/params.json':

         221852.22 msec task-clock:u              #    1.000 CPUs utilized
                 0      context-switches:u        #    0.000 /sec
                 0      cpu-migrations:u          #    0.000 /sec
          28014394      page-faults:u             #  126.275 K/sec
      658832257282      cycles:u                  #    2.970 GHz                      (83.33%)
       49927962696      stalled-cycles-frontend:u #    7.58% frontend cycles idle     (83.33%)
      285682904743      stalled-cycles-backend:u  #   43.36% backend cycles idle      (83.33%)
      943000553536      instructions:u            #    1.43  insn per cycle
                                                  #    0.30  stalled cycles per insn  (83.33%)
      212435336483      branches:u                #  957.553 M/sec                    (83.33%)
        3372189010      branch-misses:u           #    1.59% of all branches          (83.33%)

     221.907422663 seconds time elapsed

     195.279197000 seconds user
      24.508514000 seconds sys

Sorting events by `time` only

w/ pdqsort

---- meters -------------------------------------------------------------------------------
meter                         time(s)      memory(MB)
-------------------------------------------------------------------------------------------
model-init                    111.256         939.580
model-run                      79.065         871.150
meter-total                   190.321        1810.730

with util::sort_by

---- meters -------------------------------------------------------------------------------
meter                         time(s)      memory(MB)
-------------------------------------------------------------------------------------------
model-init                    111.666         939.581
model-run                      78.900         871.131
meter-total                   190.565        1810.712

Profiler

REGION                              CALLS      WALL     THREAD        %
root                                    -    68.611    274.442    100.0
  communication                         -    27.576    110.306     40.2
    enqueue                             -    17.728     70.912     25.8
      sort                         320000    16.853     67.410     24.6
      merge                        320000     0.808      3.231      1.2
      setup                        320000     0.068      0.270      0.1
    walkspikes                         40     9.848     39.391     14.4

Conclusion

We find a 30% decrease in time spent on spike walking and a 10% decrease end-to-end. Note that
this case is extremely heavy on the communication part and advances cells at 100x realtime ie
1ms wallclock for 100ms biological time which is far beyond any cable cell we encounter in the wild.

By just sorting on the time field -- we don't care about the ordering beyond that -- we find another
5%.

Side note: Memory measurement

The memory measurement isn't trustworthy, the 290MB reported above for the baseline turned into
1300MB on repeating the benchmark.

Related Issues

Closes #502

thorstenhater · 2023-09-19T10:40:46Z

Spack fails on matplotlib deprecation style=seaborn

@boeschf We get these spurious (?) failures on daint, is there a way to re-run the jobs?

…-service

bcumming

This performance issue was always something that I though would be "nice to have" - it raises its head when you are simulating enormous models on large distributed systems, which was where we wanted to be!

I have one comment about reproducibility and event sorting (I hope that this doesn't impact event performance.)

arbor/simulation.cpp

…-service

bcumming

lgtm!

thorstenhater added 4 commits September 15, 2023 11:01

Improvement of spike delivery.

eaec1b1

Simplify #con < #spk case.

b4ebe19

Remove import -> extend.

fe8ab55

Next round of experiments.

84448aa

thorstenhater requested a review from bcumming September 19, 2023 08:54

Formatting.

12926f4

thorstenhater added 3 commits September 19, 2023 16:42

Use LEX_ORD macro, sort events just by time.

a29eb8d

Slight clean-up.

18f4a75

Merge remote-tracking branch 'origin/master' into perf/kikis-delivery…

7a376f7

…-service

bcumming previously approved these changes Oct 18, 2023

View reviewed changes

arbor/simulation.cpp Outdated Show resolved Hide resolved

Add scary comment and roll back sorting.

a121c4e

thorstenhater dismissed bcumming’s stale review via a121c4e October 30, 2023 10:02

thorstenhater requested a review from bcumming October 30, 2023 10:02

thorstenhater added 4 commits October 30, 2023 12:33

Merge remote-tracking branch 'origin/master' into perf/kikis-delivery…

149af9c

…-service

The test...

0658f7f

Bump pybind11

2aea245

Merge remote-tracking branch 'origin/master' into perf/kikis-delivery…

a231fe8

…-service

bcumming approved these changes Nov 21, 2023

View reviewed changes

thorstenhater merged commit 29d84bd into arbor-sim:master Nov 21, 2023
19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

📊 Improve spike delivery #2222

📊 Improve spike delivery #2222

thorstenhater commented Sep 15, 2023 •

edited

Loading

thorstenhater commented Sep 19, 2023

bcumming left a comment

bcumming left a comment

📊 Improve spike delivery #2222

📊 Improve spike delivery #2222

Conversation

thorstenhater commented Sep 15, 2023 • edited Loading

Introduction

Baseline

Changes

Outcome

Routes not taken

Using a faster sort

Using a faster lower_bound

Building a temporary buffer of events

SoA-Splitting Spikes

Multithreading the appends

Adding a pre-filter to spike processing

Juwels Booster

Input Deck and Configuration

Validation

Baseline

Summary

Profiler

Perf

Feature Branch

Summary

Profiler

Perf

Sorting events by time only

Profiler

Conclusion

Side note: Memory measurement

Related Issues

thorstenhater commented Sep 19, 2023

bcumming left a comment

Choose a reason for hiding this comment

bcumming left a comment

Choose a reason for hiding this comment

thorstenhater commented Sep 15, 2023 •

edited

Loading

Using a faster `sort`

Using a faster `lower_bound`

Sorting events by `time` only