API for sharded linear. #3125

wujingyue · 2024-10-07T22:29:37Z

This feature request is to create a drop-in replacement for https://pytorch.org/docs/stable/generated/torch.nn.functional.linear.html that's sharded.

A linear layer can be sharded in several ways. For example,

[b, s, DIDx{4h}] = linear([b, s, h], [DIDx{4h}, h], [DIDx{4h}])
[b, s, h] = linear([b, s, DIDx{4h}], [h, DIDx{4h}], [h])
With data parallelism, expect b to be DIDy parallel in addition to the hidden dimension.

Due to #2563, we have to manually split the device dimension in the logical domain instead of having it as logical-to-loop transforms. This has prevented us from having a drop-in replacement. For example, the weight has to be 3D and torch.linear takes 2D weight.

#3073 is an attempt to support case 1 with the limitation of #2563. I did this first because having case 1 fused turns out to be important for performance.

Other cases are yet to be done, and probably should be done after #2563 to avoid accumulating too many tech debts. Note: for case 2 in particular, we'll also need to decompose a sharded linear into matmul + collective + biasadd.

The text was updated successfully, but these errors were encountered:

wujingyue · 2024-11-04T19:36:34Z

Let's focus on #2563 and revisit this after that's done.

wujingyue added the Multi-GPU label Oct 7, 2024

wujingyue added the on hold This issue should be revisited in the future label Nov 4, 2024

wujingyue closed this as not planned Won't fix, can't repro, duplicate, stale Nov 4, 2024

wujingyue removed the on hold This issue should be revisited in the future label Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API for sharded linear. #3125

API for sharded linear. #3125

wujingyue commented Oct 7, 2024

wujingyue commented Nov 4, 2024

API for sharded linear. #3125

API for sharded linear. #3125

Comments

wujingyue commented Oct 7, 2024

wujingyue commented Nov 4, 2024