Running Bamba on vLLM #3

ani300 · 2024-12-09T15:08:05Z

This issue tracks progress on running Bamba on vLLM.

Success for this issue implies the following:

Running the model successfully from the HF checkpoint in vLLM (Add Bamba Model vllm-project/vllm#10909)
Ensuring chunked prefill and TP work in vLLM
Closing the performance gap in vLLM wrt Llama of similar sizes
Reporting the performance results in a blog post

cc @raghukiran1224 @fabianlim @AdnanHoque

fabianlim · 2024-12-09T15:24:04Z

@ani300 For the TP i have a reasonable fix that ~~I have not yet upstreamed to the main PR, see this.~~ this is now upstreamed

Also, I realized that the program_ids need to be in int64 for large prefil sizes, ~~I also have not yet upstreamed the fix.~~ this is now upstreamed

Issues:

NOTE: There is one more issue not yet resolved, after the memory profiling we can see the post_profile is not recovering back to intiial_memory_usage. Unfortunately, this issue does not allow us to run -i 131072 without eager mode
-i 131072 -o 1024 -b 1 -t 1 -s 1 -q None -k auto -r 3 -u .92 -e
```
INFO 12-08 12:54:53 worker.py:236] Memory profiling results: duration=24.19 seconds, total_gpu_memory=79.15GiB, initial_memory_usage=18.88GiB, peak_torch_memory=69.37GiB, memory_usage_post_profile=48.60GiB, non_torch_memory=0.55GiB, kv_cache_size=5.28GiB, gpu_memory_utilization=0.95.
```

2 .Also I noticed this in the new VLLM version

```
[[CINFO 12-12 05:52:22 config.py:403] This model supports multiple tasks: {'embedding', 'generate'}. Defaulting to 'generate'.

```

FlashInfer backend is not supported.

AdnanHoque · 2024-12-10T18:11:50Z

The chunk_scan_fwd kernel, is not H100 optimized and exhibits low compute and memory throughput, relative to pk performance of hardware as reported by NCU for bs1, seq64k.

chunk_scan_fwd is the most compute intensive kernel in the prefill stage. Potential Hopper optimizations:

For GMEM -> SMEM transfers use TMA instead of cp.async engine
Add Persistent Tile Scheduler

fabianlim · 2025-01-06T08:47:18Z

because we share the unit tests with bamba, we need to have an extra use_mamba_kernels: False added to configuration_bamba.py. This only needs to be done on the dev model @divya-kumari32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running Bamba on vLLM #3

Running Bamba on vLLM #3

ani300 commented Dec 9, 2024

fabianlim commented Dec 9, 2024 •

edited

Loading

AdnanHoque commented Dec 10, 2024 •

edited

Loading

fabianlim commented Jan 6, 2025 •

edited

Loading

Running Bamba on vLLM #3

Running Bamba on vLLM #3

Comments

ani300 commented Dec 9, 2024

fabianlim commented Dec 9, 2024 • edited Loading

AdnanHoque commented Dec 10, 2024 • edited Loading

fabianlim commented Jan 6, 2025 • edited Loading

fabianlim commented Dec 9, 2024 •

edited

Loading

AdnanHoque commented Dec 10, 2024 •

edited

Loading

fabianlim commented Jan 6, 2025 •

edited

Loading