-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] Fix Phi-3 BNB quantization with tensor parallel #9948
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
This pull request has merge conflicts that must be resolved before it can be |
This pull request has merge conflicts that must be resolved before it can be |
@mgoin This PR is ready for review. Can you take a look at this? Thanks! |
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
I have removed all static variables added before. And the BNB with TP on Phi-3 should still work well now: Test with TP=2
|
Signed-off-by: Isotr0py <[email protected]>
shard_size = loaded_weight.shape[output_dim] // 2 | ||
shard_offset = shard_size * shard_id | ||
index = list(itertools.accumulate([0] + self.output_sizes)) | ||
orig_offsets = { | ||
str(i): (index[i], size) | ||
for i, size in enumerate(self.output_sizes) | ||
} | ||
orig_offsets["total"] = (self.output_size, 0) | ||
shard_size, shard_offset = adjust_bitsandbytes_4bit_shard( | ||
param, orig_offsets, str(shard_id)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although we use MergedColumnParallelLinear
for gate_up_proj most cases, I think we should not simply assume that weight is always sharded to two subsets.
@mgoin Can you please take a look at this PR? I have removed all model's additions and only kept logic for handling on-disk fused BNB weights with tensor parallel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much nicer state and good comments, thanks so much for iterating on this!
…t#9948) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]>
…t#9948) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Maxime Fournioux <[email protected]>
…t#9948) Signed-off-by: Isotr0py <[email protected]>
FIX #9937 (link existing issues this PR will resolve)