Enhance MM Kernel Performance and Coverage for Specific Input Scenarios #405

kiddyjinjin · 2025-01-07T05:32:26Z

PR Category

Operator

Type of Change

Performance Optimization

Description

This PR introduces improvements to the matrix multiplication (MM) kernels to address specific scenarios and optimize performance across different input patterns and sizes.

Key Updates:

1. Support Column-Major Input for Matrix B in Tests and Benchmarks

Added support to test and benchmark scenarios where the MM input matrix B is column-major.
Ensured correctness and performance consistency across different input layouts.

2. Expanded Tuning Configurations for Large-K Scenarios

Introduced additional tuning configs to better handle large-K situations.
Achieved latency-speedup improvements from 0.5x to 0.85x under large-K conditions.

3. Optimized Kernels for Small M and N Inputs

Added new kernels specifically optimized for very small M and N inputs.
These kernels fully utilize L2 cache for faster memory access and reduced latency.

4. Two-Stage Kernel for Large-K Scenarios

Implemented a split-K two-stage kernel approach:
- Stage 1: A split-K kernel computes partial results and stores them in an intermediate [SPLIT_K, M, N] matrix.
- Stage 2: A merging kernel combines these intermediate results into the final [M, N] output matrix.
This approach improves computational efficiency and resource utilization for large-K scenarios.

Progress

Change is properly reviewed (1 reviewer required, 2 recommended).
Change is responded to an issue.
Change is fully covered by a UT.

Performance

Large-K Performance with Matrix B as Column-Major

1. Large-K with Matrix B as Column-Major (`float16`)

Base Performance:
Current Performance:

Small M and N Matrix

1. Small M and N (`float16`)

Base Performance:
Current Performance:
b_row_major
b_column_major

1. Small M and N (`float32`)

Base Performance:
Current Performance:
b_row_major
b_column_major

1. Small M and N (`bfloat16`)

Base Performance:
Current Performance:
b_row_major
b_column_major

kiddyjinjin and others added 16 commits November 27, 2024 01:40

Remove INT_DTYPES from isfinite operation in benchmark and tests

19a6207

Merge branch 'FlagOpen:master' into master

3a94a83

add split-k mm implementation

57c1e92

merge upstream

3f5a292

update split-k mm

f92dfd7

update benchmark for blas & fix outer bench bug

b1b849b

update benchmark for blas

0f008da

merge upstream/master

1b9e0f9

update mm for general mm

fbc584b

update largek mm

410e065

update largek mm

faa8b19

update group merge kernel

7b31139

update group merge kernel

43bf8c9

update group merge kernel

d68835f

update group merge kernel

51f824b

update io-bound kernel

62f2e0f

iclementine self-assigned this Jan 10, 2025

kiddyjinjin added 7 commits January 13, 2025 06:01

fix illegeal memory access bug

7aa9846

add even_k support for largek

3822285

add streamk version

9183964

add streamk experimental version

6d9c406

Merge remote-tracking branch 'upstream/master'

2199151

fix wrap bug

fe6aec2

add even_k for streamk_mm

5d1bf96

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance MM Kernel Performance and Coverage for Specific Input Scenarios #405

Enhance MM Kernel Performance and Coverage for Specific Input Scenarios #405

kiddyjinjin commented Jan 7, 2025 •

edited

Loading

Enhance MM Kernel Performance and Coverage for Specific Input Scenarios #405

Are you sure you want to change the base?

Enhance MM Kernel Performance and Coverage for Specific Input Scenarios #405

Conversation

kiddyjinjin commented Jan 7, 2025 • edited Loading

PR Category

Type of Change

Description

Key Updates:

1. Support Column-Major Input for Matrix B in Tests and Benchmarks

2. Expanded Tuning Configurations for Large-K Scenarios

3. Optimized Kernels for Small M and N Inputs

4. Two-Stage Kernel for Large-K Scenarios

Progress

Performance

Large-K Performance with Matrix B as Column-Major

1. Large-K with Matrix B as Column-Major (float16)

Small M and N Matrix

1. Small M and N (float16)

1. Small M and N (float32)

1. Small M and N (bfloat16)

kiddyjinjin commented Jan 7, 2025 •

edited

Loading

1. Large-K with Matrix B as Column-Major (`float16`)

1. Small M and N (`float16`)

1. Small M and N (`float32`)

1. Small M and N (`bfloat16`)