Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow sharding fine tuned misaligned models #126

Merged
merged 2 commits into from
Dec 9, 2024
Merged

Conversation

tengomucho
Copy link
Collaborator

What does this PR do?

When loading large models, weights are sharded across a mesh of TPUs,
splitting the original weights into smaller tensors, each one with the
same shape.
This is not possible, however, if the original weights shape is not
divisible across the number of TPUs, because it results in a smaller
tensor for the last TPU.
This change pads the tensor with zeros, making it splittable across the
TPUs.

I expected this to cause a change in the model, but apparently that is not the case, and the model still has the same shape and produces a correct output. I suspect this might be due to the way the weight is loaded on the mesh, and the fact that the model figures out the part it should use, but I hadn't got the change to try many "misaligned" models,
the only one I used here it works now.

Fixes #67

Before submitting

  • Did you write any new necessary tests?

When loading large models, weights are sharded across a mesh of TPUs,
splitting the original weights into smaller tensors, each one with the
same shape.
This is not possible, however, if the original weights shape is not
divisible across the number of TPUs, because it results in a smaller
tensor for the last TPU.
This change pads the tensor with zeros, making it splittable across the
TPUs.
@tengomucho tengomucho marked this pull request as ready for review December 6, 2024 15:34
Copy link
Collaborator

@baptistecolle baptistecolle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but did you try RLHFlow/ArmoRM-Llama3-8B-v0.1 also? Should we add it to the tests? They mention it in the issue so it would be nice to check that it indeed work

@tengomucho
Copy link
Collaborator Author

@baptistecolle I could try it but RLHFlow/ArmoRM-Llama3-8B-v0.1 is aLlamaForRewardModelWithGating model (see the config.json), and we only support AutoModelForCausalLM so far. So I think this will raise issues unrelated to the fact that this model is fine-tuned.

@tengomucho tengomucho merged commit e2c5ac2 into main Dec 9, 2024
3 checks passed
@tengomucho tengomucho deleted the fine-tuned-misaligned branch December 9, 2024 10:22
baptistecolle pushed a commit that referenced this pull request Dec 10, 2024
* fix(Makefile): re-add style target

* feat(jetstream): pad weights to support unaligned sharding

When loading large models, weights are sharded across a mesh of TPUs,
splitting the original weights into smaller tensors, each one with the
same shape.
This is not possible, however, if the original weights shape is not
divisible across the number of TPUs, because it results in a smaller
tensor for the last TPU.
This change pads the tensor with zeros, making it splittable across the
TPUs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Issues running finetuned versions of supported models
2 participants