[Kernel][Triton][AMD] Change default block size for triton_scaled_mm to 128 for 3-5x speedup #11698

rasmith · 2025-01-03T00:31:01Z

Changed default block-size for triton_scaled_mm to 128x128x128 from 32x32x32 for better performance. This results in roughly 3-5x speedup.

python benchmarks/benchmark_latency.py --dtype bfloat16 --enable-chunked-prefill False --load-format dummy --batch-size 64 --num-iters-warmup 2 --num-iters 5 --input-len 2048 --output-len 128 --model /models/Phi-3-medium-128k-instruct-quantized.w8a8/

Before:

Avg latency: 14.48 seconds

After:

Avg latency: 5.52 seconds

python benchmarks/benchmark_throughput.py --dtype bfloat16 --enable-chunked-prefill False --load-format dummy --input-len 2048 --output-len 128 --model /models/Phi-3-medium-128k-instruct-quantized.w8a8/

Before:

Throughput: 10269.32 tok/s

After:

Throughput: 31150.8 tokens/s

Signed-off-by: Randall Smith <[email protected]>

github-actions · 2025-01-03T00:31:13Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

mgoin · 2025-01-03T01:03:24Z

This is an impressive improvement! Could you also show comparisons for equal input len/output len workloads, preferably with low batchsize? This could regress the TPOT for small decode batches.

It seems there is no tuning for this kernel at the moment, so maybe this could benefit from a simple heuristic for the extreme problem sizes or a few @triton.autotune configs for the blocksizes.

Signed-off-by: Randall Smith <[email protected]>

Change defeault block size for triton_scaled_mm to 128 for 4-5x speedup

5675c6b

Signed-off-by: Randall Smith <[email protected]>

Use heuristic based on cutlass_gemm_sm90_int8_dispatch

a45f569

Signed-off-by: Randall Smith <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel][Triton][AMD] Change default block size for triton_scaled_mm to 128 for 3-5x speedup #11698

[Kernel][Triton][AMD] Change default block size for triton_scaled_mm to 128 for 3-5x speedup #11698

rasmith commented Jan 3, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Jan 3, 2025

mgoin commented Jan 3, 2025 •

edited

Loading

[Kernel][Triton][AMD] Change default block size for triton_scaled_mm to 128 for 3-5x speedup #11698

Are you sure you want to change the base?

[Kernel][Triton][AMD] Change default block size for triton_scaled_mm to 128 for 3-5x speedup #11698

Conversation

rasmith commented Jan 3, 2025 • edited by github-actions bot Loading

github-actions bot commented Jan 3, 2025

mgoin commented Jan 3, 2025 • edited Loading

rasmith commented Jan 3, 2025 •

edited by github-actions bot

Loading

mgoin commented Jan 3, 2025 •

edited

Loading