Does aphrodite-engine support models quantized with vllm's llm-compressor? #563
-
For example, could I use it to quantize a model with smoothquant to do 8-bit lossless inference later? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
Not at the moment, no. But I'm currently backporting all the missing vllm features in rc_054 branch. I'm currently about 2 months behind so it'll take a week or so. In the meantime, I've added support for 4, 6, 8, and 12 bits from deepspeedfp in that branch. A later update will add support for 5, 6, 7, and a better 8bit, with good performance at higher batch sizes. You can use deepspeedfp by specifying |
Beta Was this translation helpful? Give feedback.
-
There's some support for llm-compressor models in the |
Beta Was this translation helpful? Give feedback.
Not at the moment, no. But I'm currently backporting all the missing vllm features in rc_054 branch. I'm currently about 2 months behind so it'll take a week or so. In the meantime, I've added support for 4, 6, 8, and 12 bits from deepspeedfp in that branch. A later update will add support for 5, 6, 7, and a better 8bit, with good performance at higher batch sizes. You can use deepspeedfp by specifying
-q deepspeedfp --num-deepspeedfp-bits {4,6,8,12}
when launching a 16bit model. Disclaimer: its rather slow.