Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running Bamba on vLLM #3

Open
4 tasks
ani300 opened this issue Dec 9, 2024 · 3 comments
Open
4 tasks

Running Bamba on vLLM #3

ani300 opened this issue Dec 9, 2024 · 3 comments

Comments

@ani300
Copy link

ani300 commented Dec 9, 2024

This issue tracks progress on running Bamba on vLLM.

Success for this issue implies the following:

  • Running the model successfully from the HF checkpoint in vLLM (Add Bamba Model vllm-project/vllm#10909)
  • Ensuring chunked prefill and TP work in vLLM
  • Closing the performance gap in vLLM wrt Llama of similar sizes
  • Reporting the performance results in a blog post

cc @raghukiran1224 @fabianlim @AdnanHoque

@fabianlim
Copy link

fabianlim commented Dec 9, 2024

@ani300 For the TP i have a reasonable fix that I have not yet upstreamed to the main PR, see this. this is now upstreamed

Also, I realized that the program_ids need to be in int64 for large prefil sizes, I also have not yet upstreamed the fix. this is now upstreamed

Issues:

  1. NOTE: There is one more issue not yet resolved, after the memory profiling we can see the post_profile is not recovering back to intiial_memory_usage. Unfortunately, this issue does not allow us to run -i 131072 without eager mode
    -i 131072 -o 1024 -b 1 -t 1 -s 1 -q None -k auto -r 3 -u .92 -e

    INFO 12-08 12:54:53 worker.py:236] Memory profiling results: duration=24.19 seconds, total_gpu_memory=79.15GiB, initial_memory_usage=18.88GiB, peak_torch_memory=69.37GiB, memory_usage_post_profile=48.60GiB, non_torch_memory=0.55GiB, kv_cache_size=5.28GiB, gpu_memory_utilization=0.95.
    

2 .Also I noticed this in the new VLLM version

```
[[CINFO 12-12 05:52:22 config.py:403] This model supports multiple tasks: {'embedding', 'generate'}. Defaulting to 'generate'.

```
  1. FlashInfer backend is not supported.

@AdnanHoque
Copy link

AdnanHoque commented Dec 10, 2024

The chunk_scan_fwd kernel, is not H100 optimized and exhibits low compute and memory throughput, relative to pk performance of hardware as reported by NCU for bs1, seq64k.

Screenshot 2024-12-10 at 1 02 09 PM

chunk_scan_fwd is the most compute intensive kernel in the prefill stage. Potential Hopper optimizations:

  1. For GMEM -> SMEM transfers use TMA instead of cp.async engine

  2. Add Persistent Tile Scheduler

@fabianlim
Copy link

fabianlim commented Jan 6, 2025

because we share the unit tests with bamba, we need to have an extra use_mamba_kernels: False added to configuration_bamba.py. This only needs to be done on the dev model @divya-kumari32

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants