Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

📊 Improve spike delivery #2222

Merged

Conversation

thorstenhater
Copy link
Contributor

@thorstenhater thorstenhater commented Sep 15, 2023

Introduction

Issue #502 claims that under certain circumstances, spikes walking can become a bottleneck.
L2 cache misses are posited as the root cause. The setup required for this is pretty extreme
though.

Baseline

Selected to show the problem while being feasible to run on a laptop.

Input

{
    "name": "test",
    "num-cells": 256,
    "duration": 400,
    "min-delay": 10,
    "fan-in": 10000,
    "realtime-ratio": 0.01,
    "spike-frequency": 40,
    "threads": 4,
    "ranks": 8000
}

Timing

❯ hyperfine 'bin/drybench ~/src/arbor/example/drybench/params.json'
Benchmark 1: bin/drybench ~/src/arbor/example/drybench/params.json
  Time (mean ± σ):      4.334 s ±  0.146 s    [User: 6.026 s, System: 0.418 s]
  Range (min … max):    4.148 s …  4.667 s    10 runs

Changes

  • Store connection list as structure of arrays instead of array of structures. This reduces cache misses during binary search for the correct source.
  • Use lower bound instead of equal range. Removes one binary search, which is cache-unfriendly.
  • Treat all spikes from the same source in one go. Keeps all values around instead of discarding and re-acquiring.
  • Swap member order in spike_event reduces size from 24 to 16 bytes.

Outcome

We get some minor reduction in runtime on my local machine

❯ hyperfine 'bin/drybench ../example/drybench/params.json'
Benchmark 1: bin/drybench ../example/drybench/params.json
  Time (mean ± σ):      4.225 s ±  0.167 s    [User: 5.939 s, System: 0.397 s]
  Range (min … max):    4.064 s …  4.632 s    10 runs

4.064s vs 4.148s, but this is still within one $\sigma$.

Routes not taken

Using a faster sort

Tried pdqsort, but sorting isn't a real bottleneck and/or pdq doesn't
improve on our problem.

Using a faster lower_bound

Similarly to sort no variation of lower_bound improves measurably.
We can conclude that the enqueuing is bound by the actual pushing of
events into cells' queues. As noted in #502, L2 cache misses are a probable
root cause.

Building a temporary buffer of events

Tried to create a scratch space to dump all events into, keyed by their queue index.
To avoid allocations, the scratch buffer
Then, we sort this by index and build the queues in one go. Proved a significant
slow-down.

SoA-Splitting Spikes

Doesn't improve our main bottleneck and spike is pretty small (=not much
waste on a cacheline) already.

Multithreading the appends

That'll only worsen the amount of cache misses as L2 is shared between threads.
Also if cache misses are our problem, this won't address the root cause.

Adding a pre-filter to spike processing

Reducing the incoming spikes by tracking all sources terminating at the local process
didn't yield an improvement while rejecting ~25% of all incoming event. Instead, a significant
slow-down was observed.

Juwels Booster

Input Deck and Configuration

  • JuwelsBooster develbooster queue
  • CMake
    cmake .. -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DARB_USE_BUNDLED_LIBS=ON -DARB_VECTORIZE=ON -DARB_WITH_PYTHON=OFF -DARB_WITH_MPI=ON -DBUILD_SHARED_LIBS=OFF -DCMAKE_INTERPROCEDURAL_OPTIMIZATION=ON -DARB_PROFILIING=ON -DCMAKE_INSTALL_PREFIX=../install -G Ninja -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DARB_GPU=cuda -DCMAKE_BUILD_TYPE=release
    
  • Input
    {
      "name": "test",
      "num-cells": 8000,
      "duration": 200,
      "min-delay": 10,
      "fan-in": 5000,
      "realtime-ratio": 0.01,
      "spike-frequency": 50,
      "threads": 4,
      "ranks": 10000
    }

Validation

benchmark parameters:
  name:           test
  cells per rank: 8000
  duration:       200 ms
  fan in:         5000 connections/cell
  min delay:      10 ms
  spike freq:     50 Hz
  cell overhead:  0.01 ms to advance 1 ms
expected:
  cell advance:   16 s
  spikes:         800000000
  events:         3204710400
  spikes:         2000 per interval
  events:         151969 per cell per interval
HW resources:
  threads:        4
  ranks:          10000

and

808110000 spikes generated at rate of 10 spikes per cell

Baseline

Summary

---- meters -------------------------------------------------------------------------------
meter                         time(s)      memory(MB)
-------------------------------------------------------------------------------------------
model-init                    113.175         288.951
model-run                     107.741        1306.015
meter-total                   220.916        1594.966

Profiler

REGION                          CALLS      WALL     THREAD        %
root                                -    78.541    314.163    100.0
  communication                     -    36.991    147.964     47.1
    enqueue                         -    21.247     84.986     27.1
      sort                     320000    19.952     79.807     25.4
      merge                    320000     1.174      4.695      1.5
      setup                    320000     0.121      0.483      0.2
    walkspikes                     40    15.735     62.940     20.0
    exchange                        -     0.010      0.038      0.0
      sort                         40     0.006      0.022      0.0
      gatherlocal                  40     0.004      0.016      0.0
      gather                       40     0.000      0.000      0.0
        remote                     40     0.000      0.000      0.0
          post_process             40     0.000      0.000      0.0

Perf

 Performance counter stats for 'bin/drybench ../example/drybench/params.json':

         245572.30 msec task-clock:u              #    1.000 CPUs utilized
                 0      context-switches:u        #    0.000 /sec
                 0      cpu-migrations:u          #    0.000 /sec
          29077724      page-faults:u             #  118.408 K/sec
      730063012768      cycles:u                  #    2.973 GHz                      (83.33%)
       25160708320      stalled-cycles-frontend:u #    3.45% frontend cycles idle     (83.33%)
      332882140632      stalled-cycles-backend:u  #   45.60% backend cycles idle      (83.33%)
     1080323993228      instructions:u            #    1.48  insn per cycle
                                                  #    0.31  stalled cycles per insn  (83.33%)
      243594581289      branches:u                #  991.946 M/sec                    (83.33%)
        3447791949      branch-misses:u           #    1.42% of all branches          (83.33%)

     245.609154303 seconds time elapsed

     218.071863000 seconds user
      25.114123000 seconds sys

Feature Branch

Summary

---- meters -------------------------------------------------------------------------------
meter                         time(s)      memory(MB)
-------------------------------------------------------------------------------------------
model-init                    112.901         939.580
model-run                      84.730         871.134
meter-total                   197.631        1810.714

Profiler

REGION                              CALLS      WALL     THREAD        %
root                                    -    71.717    286.869    100.0
  communication                         -    30.408    121.633     42.4
    enqueue                             -    20.006     80.023     27.9
      sort                         320000    18.984     75.938     26.5
      merge                        320000     0.937      3.746      1.3
      setup                        320000     0.085      0.340      0.1
    walkspikes                         40    10.401     41.605     14.5
    exchange                            -     0.001      0.005      0.0
      sort                             40     0.001      0.004      0.0
      gatherlocal                      40     0.000      0.001      0.0
      gather                           40     0.000      0.000      0.0
        remote                         40     0.000      0.000      0.0
          post_process                 40     0.000      0.000      0.0
    spikeio                            40     0.000      0.000      0.0

Perf

 Performance counter stats for 'bin/drybench /p/project/cslns/hater1/arbor/example/drybench/params.json':

         221852.22 msec task-clock:u              #    1.000 CPUs utilized
                 0      context-switches:u        #    0.000 /sec
                 0      cpu-migrations:u          #    0.000 /sec
          28014394      page-faults:u             #  126.275 K/sec
      658832257282      cycles:u                  #    2.970 GHz                      (83.33%)
       49927962696      stalled-cycles-frontend:u #    7.58% frontend cycles idle     (83.33%)
      285682904743      stalled-cycles-backend:u  #   43.36% backend cycles idle      (83.33%)
      943000553536      instructions:u            #    1.43  insn per cycle
                                                  #    0.30  stalled cycles per insn  (83.33%)
      212435336483      branches:u                #  957.553 M/sec                    (83.33%)
        3372189010      branch-misses:u           #    1.59% of all branches          (83.33%)

     221.907422663 seconds time elapsed

     195.279197000 seconds user
      24.508514000 seconds sys

Sorting events by time only

w/ pdqsort

---- meters -------------------------------------------------------------------------------
meter                         time(s)      memory(MB)
-------------------------------------------------------------------------------------------
model-init                    111.256         939.580
model-run                      79.065         871.150
meter-total                   190.321        1810.730

with util::sort_by

---- meters -------------------------------------------------------------------------------
meter                         time(s)      memory(MB)
-------------------------------------------------------------------------------------------
model-init                    111.666         939.581
model-run                      78.900         871.131
meter-total                   190.565        1810.712

Profiler

REGION                              CALLS      WALL     THREAD        %
root                                    -    68.611    274.442    100.0
  communication                         -    27.576    110.306     40.2
    enqueue                             -    17.728     70.912     25.8
      sort                         320000    16.853     67.410     24.6
      merge                        320000     0.808      3.231      1.2
      setup                        320000     0.068      0.270      0.1
    walkspikes                         40     9.848     39.391     14.4

Conclusion

We find a 30% decrease in time spent on spike walking and a 10% decrease end-to-end. Note that
this case is extremely heavy on the communication part and advances cells at 100x realtime ie
1ms wallclock for 100ms biological time which is far beyond any cable cell we encounter in the wild.

By just sorting on the time field -- we don't care about the ordering beyond that -- we find another
5%.

Side note: Memory measurement

The memory measurement isn't trustworthy, the 290MB reported above for the baseline turned into
1300MB on repeating the benchmark.

Related Issues

Closes #502

@thorstenhater
Copy link
Contributor Author

Spack fails on matplotlib deprecation style=seaborn

@boeschf We get these spurious (?) failures on daint, is there a way to re-run the jobs?

bcumming
bcumming previously approved these changes Oct 18, 2023
Copy link
Member

@bcumming bcumming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This performance issue was always something that I though would be "nice to have" - it raises its head when you are simulating enormous models on large distributed systems, which was where we wanted to be!

I have one comment about reproducibility and event sorting (I hope that this doesn't impact event performance.)

arbor/simulation.cpp Outdated Show resolved Hide resolved
Copy link
Member

@bcumming bcumming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@thorstenhater thorstenhater merged commit 29d84bd into arbor-sim:master Nov 21, 2023
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimise spike walking
2 participants