-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
📊 Improve spike delivery #2222
📊 Improve spike delivery #2222
Conversation
Spack fails on matplotlib deprecation @boeschf We get these spurious (?) failures on daint, is there a way to re-run the jobs? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This performance issue was always something that I though would be "nice to have" - it raises its head when you are simulating enormous models on large distributed systems, which was where we wanted to be!
I have one comment about reproducibility and event sorting (I hope that this doesn't impact event performance.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
Introduction
Issue #502 claims that under certain circumstances, spikes walking can become a bottleneck.
L2 cache misses are posited as the root cause. The setup required for this is pretty extreme
though.
Baseline
Selected to show the problem while being feasible to run on a laptop.
Input
Timing
Changes
lower bound
instead ofequal range
. Removes one binary search, which is cache-unfriendly.spike_event
reduces size from 24 to 16 bytes.Outcome
We get some minor reduction in runtime on my local machine
4.064s vs 4.148s, but this is still within one$\sigma$ .
Routes not taken
Using a faster
sort
Tried
pdqsort
, but sorting isn't a real bottleneck and/orpdq
doesn'timprove on our problem.
Using a faster
lower_bound
Similarly to
sort
no variation oflower_bound
improves measurably.We can conclude that the enqueuing is bound by the actual pushing of
events into cells' queues. As noted in #502, L2 cache misses are a probable
root cause.
Building a temporary buffer of events
Tried to create a scratch space to dump all events into, keyed by their queue index.
To avoid allocations, the scratch buffer
Then, we sort this by index and build the queues in one go. Proved a significant
slow-down.
SoA-Splitting Spikes
Doesn't improve our main bottleneck and
spike
is pretty small (=not muchwaste on a cacheline) already.
Multithreading the appends
That'll only worsen the amount of cache misses as L2 is shared between threads.
Also if cache misses are our problem, this won't address the root cause.
Adding a pre-filter to spike processing
Reducing the incoming spikes by tracking all sources terminating at the local process
didn't yield an improvement while rejecting ~25% of all incoming event. Instead, a significant
slow-down was observed.
Juwels Booster
Input Deck and Configuration
Validation
and
Baseline
Summary
Profiler
Perf
Feature Branch
Summary
Profiler
Perf
Sorting events by
time
onlyw/ pdqsort
with
util::sort_by
Profiler
Conclusion
We find a 30% decrease in time spent on spike walking and a 10% decrease end-to-end. Note that
this case is extremely heavy on the communication part and advances cells at 100x realtime ie
1ms wallclock for 100ms biological time which is far beyond any cable cell we encounter in the wild.
By just sorting on the
time
field -- we don't care about the ordering beyond that -- we find another5%.
Side note: Memory measurement
The memory measurement isn't trustworthy, the 290MB reported above for the baseline turned into
1300MB on repeating the benchmark.
Related Issues
Closes #502