[Fix] fix_vllm_moe_quant #341

lihaoyang-amd · 2024-12-20T04:05:20Z

Detail:
Alibaba benchmark test script quantizes a float model using vllm by giving in the parameters --quantization fp8 and --kv_cache_dtype fp8.
They don't use quark quantized model,
When using the 'moe_final_v0.6.0_Nov19' to do benchmark test for mixtral 8*7B, we got garbage output.
After checking, it was found that it was caused by a code missing in the vllm/model_executor/layers/quantization/fp8.py (For moe model,the func fuse_shuffle and moe_padding were not executed if using vllm to quantize model, and If passed to the vllm a quantized model, the two func can be executed correctly without errors ).

fix_vllm_quant

8ba3f0c

lihaoyang-amd marked this pull request as ready for review December 20, 2024 04:06

lihaoyang-amd requested a review from gshtras December 20, 2024 04:07

lihaoyang-amd changed the title ~~[fix] fix_vllm_quant~~ [Fix] fix_vllm_moe_quant Dec 20, 2024

lihaoyang-amd closed this Dec 20, 2024

gshtras deleted the moe_final_v0.6.0_Nov19_fix_dynamic_quant branch December 20, 2024 15:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] fix_vllm_moe_quant #341

[Fix] fix_vllm_moe_quant #341

lihaoyang-amd commented Dec 20, 2024 •

edited by github-actions bot

Loading

[Fix] fix_vllm_moe_quant #341

[Fix] fix_vllm_moe_quant #341

Conversation

lihaoyang-amd commented Dec 20, 2024 • edited by github-actions bot Loading

lihaoyang-amd commented Dec 20, 2024 •

edited by github-actions bot

Loading