Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance MM Kernel Performance and Coverage for Specific Input Scenarios #405

Open
wants to merge 23 commits into
base: master
Choose a base branch
from

Conversation

kiddyjinjin
Copy link
Collaborator

@kiddyjinjin kiddyjinjin commented Jan 7, 2025

PR Category

Operator

Type of Change

Performance Optimization

Description

This PR introduces improvements to the matrix multiplication (MM) kernels to address specific scenarios and optimize performance across different input patterns and sizes.

Key Updates:

1. Support Column-Major Input for Matrix B in Tests and Benchmarks

  • Added support to test and benchmark scenarios where the MM input matrix B is column-major.
  • Ensured correctness and performance consistency across different input layouts.

2. Expanded Tuning Configurations for Large-K Scenarios

  • Introduced additional tuning configs to better handle large-K situations.
  • Achieved latency-speedup improvements from 0.5x to 0.85x under large-K conditions.

3. Optimized Kernels for Small M and N Inputs

  • Added new kernels specifically optimized for very small M and N inputs.
  • These kernels fully utilize L2 cache for faster memory access and reduced latency.

4. Two-Stage Kernel for Large-K Scenarios

  • Implemented a split-K two-stage kernel approach:
    • Stage 1: A split-K kernel computes partial results and stores them in an intermediate [SPLIT_K, M, N] matrix.
    • Stage 2: A merging kernel combines these intermediate results into the final [M, N] output matrix.
  • This approach improves computational efficiency and resource utilization for large-K scenarios.

Progress

  • Change is properly reviewed (1 reviewer required, 2 recommended).
  • Change is responded to an issue.
  • Change is fully covered by a UT.

Performance

Large-K Performance with Matrix B as Column-Major

1. Large-K with Matrix B as Column-Major (float16)
  • Base Performance:
    image
  • Current Performance:
    image

Small M and N Matrix

1. Small M and N (float16)
  • Base Performance:
    image
  • Current Performance:
  • b_row_major
    image
  • b_column_major
    image
1. Small M and N (float32)
  • Base Performance:
    image
  • Current Performance:
  • b_row_major
    image
  • b_column_major
    image
1. Small M and N (bfloat16)
  • Base Performance:
    image

  • Current Performance:

  • b_row_major
    image

  • b_column_major
    image

@iclementine iclementine self-assigned this Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants