add a quantized arg for vllm to run awq models #77

rishsriv · 2024-02-03T12:47:14Z

Used this to test our private AWQ model. Works reasonably well! AWQ seems much faster when running as an API for single use requests – though is slower than non-AWQ for batched processing. Also consumes just 1/4th the memory.

Side note: accuracy was reasonable. -3% compared to the non-AWQ version with the corresponding num_beams (tested for num_beams 4, 2, and 1).

rishsriv added 4 commits February 3, 2024 20:47

add a quantized arg for vllm to run awq models

0dab4b5

Update vllm_runner.py

1a6af0b

Update README.md

952b37e

formatting

ec8d935

rishsriv marked this pull request as ready for review February 3, 2024 12:52

wongjingping approved these changes Feb 5, 2024

View reviewed changes

wongjingping merged commit 57990fc into main Feb 5, 2024
2 checks passed

wongjingping deleted the rishabh/vllm-awq branch February 5, 2024 01:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add a quantized arg for vllm to run awq models #77

add a quantized arg for vllm to run awq models #77

rishsriv commented Feb 3, 2024 •

edited

Loading

add a quantized arg for vllm to run awq models #77

add a quantized arg for vllm to run awq models #77

Conversation

rishsriv commented Feb 3, 2024 • edited Loading

rishsriv commented Feb 3, 2024 •

edited

Loading