Allow sharding fine tuned misaligned models #126

tengomucho · 2024-12-06T14:17:26Z

What does this PR do?

When loading large models, weights are sharded across a mesh of TPUs,
splitting the original weights into smaller tensors, each one with the
same shape.
This is not possible, however, if the original weights shape is not
divisible across the number of TPUs, because it results in a smaller
tensor for the last TPU.
This change pads the tensor with zeros, making it splittable across the
TPUs.

I expected this to cause a change in the model, but apparently that is not the case, and the model still has the same shape and produces a correct output. I suspect this might be due to the way the weight is loaded on the mesh, and the fact that the model figures out the part it should use, but I hadn't got the change to try many "misaligned" models,
the only one I used here it works now.

Fixes #67

Before submitting

Did you write any new necessary tests?

When loading large models, weights are sharded across a mesh of TPUs, splitting the original weights into smaller tensors, each one with the same shape. This is not possible, however, if the original weights shape is not divisible across the number of TPUs, because it results in a smaller tensor for the last TPU. This change pads the tensor with zeros, making it splittable across the TPUs.

baptistecolle

LGTM but did you try RLHFlow/ArmoRM-Llama3-8B-v0.1 also? Should we add it to the tests? They mention it in the issue so it would be nice to check that it indeed work

tengomucho · 2024-12-06T16:09:30Z

@baptistecolle I could try it but RLHFlow/ArmoRM-Llama3-8B-v0.1 is aLlamaForRewardModelWithGating model (see the config.json), and we only support AutoModelForCausalLM so far. So I think this will raise issues unrelated to the fact that this model is fine-tuned.

* fix(Makefile): re-add style target * feat(jetstream): pad weights to support unaligned sharding When loading large models, weights are sharded across a mesh of TPUs, splitting the original weights into smaller tensors, each one with the same shape. This is not possible, however, if the original weights shape is not divisible across the number of TPUs, because it results in a smaller tensor for the last TPU. This change pads the tensor with zeros, making it splittable across the TPUs.

tengomucho added 2 commits December 6, 2024 13:11

fix(Makefile): re-add style target

40defe3

tengomucho requested a review from baptistecolle December 6, 2024 15:34

tengomucho marked this pull request as ready for review December 6, 2024 15:34

baptistecolle approved these changes Dec 6, 2024

View reviewed changes

tengomucho merged commit e2c5ac2 into main Dec 9, 2024
3 checks passed

tengomucho deleted the fine-tuned-misaligned branch December 9, 2024 10:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow sharding fine tuned misaligned models #126

Allow sharding fine tuned misaligned models #126

tengomucho commented Dec 6, 2024

baptistecolle left a comment

tengomucho commented Dec 6, 2024

Allow sharding fine tuned misaligned models #126

Allow sharding fine tuned misaligned models #126

Conversation

tengomucho commented Dec 6, 2024

What does this PR do?

Before submitting

baptistecolle left a comment

Choose a reason for hiding this comment

tengomucho commented Dec 6, 2024