Shard MHA. #3115

wujingyue · 2024-10-05T04:49:49Z

As a follow-up to #3045.

Similar to #3073, `sdpfa_fwd` shouldn't assume DIDs are available at definition time. Instead, treat extra preceding dimensions as batch at definition time and check they are device parallel at evaluation time. This is required to land #3115.

For #2199. This PR only shards the MLP. MHA will come in a separate PR (#3115) to keep changes small and incremental.

wujingyue · 2024-10-09T17:38:29Z

!build

wujingyue · 2024-10-10T18:43:18Z

!build

samnordmann

LGTM

samnordmann · 2024-10-15T11:12:12Z

tests/python/test_multidevice.py

+                [0, 2, 3],
+            ),
+        )
+        T152_matmul = self.ops.sum(T152_local_matmul, [0])  # allreduce


I wonder what would happen currently if we do not decompose the matmul and the allreduce...

The first thing that'll break is that linear will produce a wrong shape. linear, as is implemented today, will output a tensor of rank input_rank + weight_rank - 2 = 5. However, we want the shape to be [d,b,s,e] and thus 4D.

samnordmann · 2024-10-15T11:40:45Z

tests/python/test_multidevice.py

-        T131 = self.ops.permute(T130, dims=[0, 2, 1, 3])
-        T137 = self.ops.reshape(T117, new_shape=[b, s, h, e // h])
-        T138 = self.ops.permute(T137, dims=[0, 2, 1, 3])
+        T123 = self.ops.reshape(T104, new_shape=[d, b, s, h // d, e // h])


IIUC, we pass from shape [d, b, s, e//d] to [d, b, s, h//d, e//h]. Nothing illegal about it but it looks surprising to me so I just want to make sure

I double checked -- it looks right. MHA is head parallel according to Figure 3b in https://arxiv.org/pdf/1909.08053.

wujingyue · 2024-10-15T16:30:27Z

!build

This was referenced Oct 5, 2024

Partially sharded transformer layer forward using Python API. #3045

Merged

Fix sdpfa_fwd to not assume the presence of DIDs. #3116

Merged

wujingyue added a commit that referenced this pull request Oct 9, 2024

Partially sharded transformer layer forward using Python API. (#3045)

6eff42b

For #2199. This PR only shards the MLP. MHA will come in a separate PR (#3115) to keep changes small and incremental.

Base automatically changed from wjy/forward to main October 9, 2024 17:05

wujingyue force-pushed the wjy/mha branch 2 times, most recently from 28fadd2 to 0ec0749 Compare October 9, 2024 17:37

wujingyue marked this pull request as ready for review October 9, 2024 17:38

wujingyue requested review from cowanmeg and samnordmann October 9, 2024 17:38

wujingyue force-pushed the wjy/mha branch from 0ec0749 to 1ab167c Compare October 10, 2024 18:42

samnordmann approved these changes Oct 15, 2024

View reviewed changes

Shard MHA in the test.

92a0e3b

wujingyue force-pushed the wjy/mha branch from 1ab167c to 92a0e3b Compare October 15, 2024 16:17

Update batch size to 1 to be consistent with the TE baseline.

e734f81

wujingyue merged commit 66c4bed into main Oct 15, 2024
33 of 34 checks passed

wujingyue deleted the wjy/mha branch October 15, 2024 18:13

wujingyue added the enhancement New feature or request label Nov 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shard MHA. #3115

Shard MHA. #3115

wujingyue commented Oct 5, 2024

wujingyue commented Oct 9, 2024

wujingyue commented Oct 10, 2024

samnordmann left a comment

samnordmann Oct 15, 2024

wujingyue Oct 15, 2024

samnordmann Oct 15, 2024

wujingyue Oct 15, 2024

wujingyue commented Oct 15, 2024

Shard MHA. #3115

Shard MHA. #3115

Conversation

wujingyue commented Oct 5, 2024

wujingyue commented Oct 9, 2024

wujingyue commented Oct 10, 2024

samnordmann left a comment

Choose a reason for hiding this comment

samnordmann Oct 15, 2024

Choose a reason for hiding this comment

wujingyue Oct 15, 2024

Choose a reason for hiding this comment

samnordmann Oct 15, 2024

Choose a reason for hiding this comment

wujingyue Oct 15, 2024

Choose a reason for hiding this comment

wujingyue commented Oct 15, 2024