Skip to content

The official CUDA kernel implementation for Mixture of Sparse Attention

License

Notifications You must be signed in to change notification settings

thu-nics/MoA_Kernel

Repository files navigation

MoA Kernel

This is the CUDA kernel implementation for MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression, or MoA.

Installation

We test our kernel with CUDA 12.4 and PyTorch 2.4. Install the required environments for MoA before installing the kernel.

cd python
FLASHINFER_LOGITS_POST_HOOKS=0 FLASHINFER_HEAD_DIMS=64,128 FLASHINFER_POS_ENCODING_MODES=0 python setup.py install

Quick Test

python accuracy_test.py

Acknowledgement

Our kernel is build upon FlashInfer project.

TODO

  • support batch size > 1
  • support multi-GPU inference
  • support GQA

About

The official CUDA kernel implementation for Mixture of Sparse Attention

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •