-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] Add support for OLMo architecture #3046
Conversation
@Lanssi Thanks for your contribution! I’ll take a look at your code once it passes CI. |
Yes! Thanks! |
@tlopex The following content is for supplement.
And I did some tests.
And this is the result:
It seems to be running well.
Maybe we can further test the 7B variant(link: https://huggingface.co/allenai/OLMo-7B-0724-hf). But for the moment I can't meet the hardware requirement to test it. |
@Lanssi Thank your so much for your additional supplement and testing! First, about your fourth point , the implement of Second, could you please tell me which way you chose to quantize the model? Besides, I can help test the 7B variant if needed. |
Sure. I tested q0f16 merely on android, q4f16_1 and q4f32_1 both on android and cuda. And, thanks for your help! |
Overall it looks great to me! I tested it on my CUDA device with q4f16_1 and it worked well. |
@tlopex Thanks for reviewing my code. I will check it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Lanssi for contributing! We can remove some quantizations in follow-up PRs if they are not supported.
This PR add support for OLMo architecture.
Additional support: add support for clip-qkv.
Test: already tested on android(pixel 4) and cuda(setting tensor_parallel_shrads=2)
Test model: amd/AMD-OLMo-1B(without clip_qkv) and allenai/OLMo-1B-0724-hf(with clip_qkv).
However, generation quality of the latter one is not so good as expected even though I've tried different implementation of the clip_qkv mechanism, e.g. te.compute and nn.maximum/minimum.
And finally, I checked the doc and following one is the most simplified:
But still the result isn't good enough.
This is output from CLI:
And this is the output from Andorid(pixel 4):
Please note that this is my first PR. If I got something missed, please point it out. Thanks!