This improves drastically overthreading issue (>48cores) #11

Narsil · 2023-07-20T07:51:05Z

I'm not sure that this change is optimal by any means.

But it does yield a significant improvement when running relatively small matmul over a 48 core machine.

Before:

// 48 cores
parallelism-48-f32-nnn-gemm-6×2304×768
                        time:   [2.2215 ms 2.2584 ms 2.2906 ms]
                        change: [-2.7095% -0.4486% +2.0755%] (p = 0.74 > 0.05)
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

parallelism-none-f32-nnn-gemm-6×2304×768
                        time:   [745.65 µs 746.78 µs 748.09 µs]
                        change: [-0.6916% -0.4244% -0.1303%] (p = 0.02 < 0.05)
                        Change within noise threshold.

After:

parallelism-48-f32-nnn-gemm-6×2304×768
                        time:   [641.83 µs 651.66 µs 664.90 µs]
                        change: [-71.903% -71.301% -70.685%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) high mild
  1 (10.00%) high severe

parallelism-none-f32-nnn-gemm-6×2304×768
                        time:   [741.77 µs 744.47 µs 748.39 µs]
                        change: [-0.9774% -0.6209% -0.2174%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe

At least we're not slowing down drastically (but this is not an improvement either)

benchmark)

sarah-quinones · 2023-07-20T08:04:34Z

this formula here seems pretty cryptic, is there some reasoning behind it?

 let n_threads = std::cmp::max(1, std::cmp::min(max_threads, (total_work - threading_threshold + 1) / threading_threshold));

Narsil · 2023-07-20T13:05:10Z

threading_threshold is what you had before to get num_threads=1 vs num_thread=all.

(total_work - threading_threshold + 1) / threading_threshold
Is simply ceil(total_work/threading_threshold) (To get a heuristic on how many threads this looks ok to share.
min(max_threads, X) is to not use more threads than requested
max(1, X) is to use at least 1.

Narsil added 3 commits July 16, 2023 18:39

Adding a parallelism bench.

8d205b6

Fixing large multi-threading (-40% improvement for parallelism

0da2128

benchmark)

Fix.

c1a5b31

Narsil added 2 commits July 26, 2023 13:35

Fix tests.

76ea6bd

Format.

b11ea6f

Narsil force-pushed the bench_rayon branch from d0c85b1 to b11ea6f Compare July 26, 2023 11:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This improves drastically overthreading issue (>48cores) #11

This improves drastically overthreading issue (>48cores) #11

Narsil commented Jul 20, 2023

sarah-quinones commented Jul 20, 2023

Narsil commented Jul 20, 2023

This improves drastically overthreading issue (>48cores) #11

Are you sure you want to change the base?

This improves drastically overthreading issue (>48cores) #11

Conversation

Narsil commented Jul 20, 2023

sarah-quinones commented Jul 20, 2023

Narsil commented Jul 20, 2023