You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While investigating the crate performance I found out that running parallelism could be highly detrimental to performance.
This only occurs with machines with a lot of cores (and therefore threads)
parallelism-8-f32-nnn-gemm-6×2304×768
time: [176.79 µs 182.91 µs 186.61 µs]
change: [-9.5273% -4.5251% -0.1034%] (p = 0.10 > 0.05)
No change in performance detected.
parallelism-none-f32-nnn-gemm-6×2304×768
time: [685.08 µs 686.80 µs 687.87 µs]
change: [-1.0090% -0.5130% -0.1028%] (p = 0.04 < 0.05)
Change within noise threshold.
parallelism-8-f32-nnt-gemm-6×2304×768
time: [433.25 µs 444.28 µs 459.26 µs]
change: [+9.9070% +13.388% +16.764%] (p = 0.00 < 0.05)
Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high mild
parallelism-none-f32-nnt-gemm-6×2304×768
time: [1.3960 ms 1.4004 ms 1.4051 ms]
change: [+14.439% +15.374% +16.258%] (p = 0.00 < 0.05)
Performance has regressed.
Which is sort of OK, 8 parallelism is indeed ~3.5x faster so some speedups
However on 48 cores:
parallelism-48-f32-nnn-gemm-6×2304×768
time: [2.2364 ms 2.2723 ms 2.3164 ms]
Found 2 outliers among 10 measurements (20.00%)
1 (10.00%) low mild
1 (10.00%) high severe
parallelism-none-f32-nnn-gemm-6×2304×768
time: [752.12 µs 752.97 µs 753.81 µs]
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) low mild
parallelism-48-f32-nnt-gemm-6×2304×768
time: [2.3022 ms 2.3255 ms 2.3660 ms]
parallelism-none-f32-nnt-gemm-6×2304×768
time: [789.54 µs 789.93 µs 790.39 µs]
There is a big slowdown from over parallelism.
The flamegraph actually shows this pretty well
Is there anything we can do to help here ?
I'm under the impression that using a simple par_chunks instead of par_iter with maybe some length heuristics could help spawn little amount of threads when the matmul is small enough.
The text was updated successfully, but these errors were encountered:
Hi,
While investigating the crate performance I found out that running parallelism could be highly detrimental to performance.
This only occurs with machines with a lot of cores (and therefore threads)
Here is the bench I added https://github.com/Narsil/gemm/tree/bench_rayon
On a regular desktop (8 cores) I see:
Which is sort of OK, 8 parallelism is indeed ~3.5x faster so some speedups
However on 48 cores:
There is a big slowdown from over parallelism.
The flamegraph actually shows this pretty well
Is there anything we can do to help here ?
I'm under the impression that using a simple
par_chunks
instead ofpar_iter
with maybe some length heuristics could help spawn little amount of threads when the matmul is small enough.The text was updated successfully, but these errors were encountered: