Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a quantized arg for vllm to run awq models #77

Merged
merged 4 commits into from
Feb 5, 2024

Conversation

rishsriv
Copy link
Member

@rishsriv rishsriv commented Feb 3, 2024

Used this to test our private AWQ model. Works reasonably well! AWQ seems much faster when running as an API for single use requests – though is slower than non-AWQ for batched processing. Also consumes just 1/4th the memory.

Side note: accuracy was reasonable. -3% compared to the non-AWQ version with the corresponding num_beams (tested for num_beams 4, 2, and 1).

@rishsriv rishsriv marked this pull request as ready for review February 3, 2024 12:52
@wongjingping wongjingping merged commit 57990fc into main Feb 5, 2024
2 checks passed
@wongjingping wongjingping deleted the rishabh/vllm-awq branch February 5, 2024 01:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants