Allow linear to take a >2D weight and a >1D bias. #3073

wujingyue · 2024-10-01T18:29:44Z

As long as the extra dimensions are DID-parallel.

This allows a distributed transformer layer to use linear (instead of matmul+add) for speed and will simplify the pending #3045.

To avoid ambiguity, this PR also removes the support for 1D weight and 0D bias; otherwise, it's unclear whether a 2D weight is one device dimension plus a non-device or two non-devices. This support can be added back by changing the thunder-to-nvFuser bridge to convert a 1D/0D linear to unsqueeze followed by a 2D/1D linear followed by a squeeze.

wujingyue · 2024-10-01T19:40:32Z

!build

csrc/ops/composite.cpp

wujingyue · 2024-10-02T21:31:52Z

!build

wujingyue · 2024-10-02T23:48:46Z

!build

wujingyue · 2024-10-03T02:08:08Z

!build

wujingyue · 2024-10-03T05:50:48Z

!build

wujingyue · 2024-10-04T15:08:42Z

!build

cowanmeg

Overall LGTM, just some little things about adding comments

tests/python/opinfo_input_generators.py

csrc/ir/nodes.cpp

As long as the extra dimensions are DID-parallel.

wujingyue · 2024-10-04T17:20:07Z

!build

wujingyue · 2024-10-04T22:50:54Z

!build

@rdspring1

This PR fixes a bug introduced in #3073. This bug causes `nvFuser.Tensor` to have a different rank than the corresponding `TensorView`. This didn't trigger any test failure until I wrote a more complicated test that `slice`s the output of a linear. Question for @rdspring1 and/or @kevinstephano: shouldn't this bug be caught earlier? I guess when the Python frontend finalizes the definition it should have checked the output `nvFuser.Tensor`s are consistent with the output `TensorView`s. Wdyt?

Similar to #3073, `sdpfa_fwd` shouldn't assume DIDs are available at definition time. Instead, treat extra preceding dimensions as batch at definition time and check they are device parallel at evaluation time. This is required to land #3115.

wujingyue requested review from Priya2698 and cowanmeg October 1, 2024 18:29

Priya2698 reviewed Oct 1, 2024

View reviewed changes

csrc/ops/composite.cpp Outdated Show resolved Hide resolved

wujingyue force-pushed the wjy/prefer branch 2 times, most recently from aa7b681 to ad951e0 Compare October 2, 2024 20:13

Base automatically changed from wjy/prefer to main October 2, 2024 21:29

wujingyue force-pushed the wjy/three branch from 4183ec1 to 64ffd24 Compare October 2, 2024 21:31

wujingyue force-pushed the wjy/three branch from 64ffd24 to bb40185 Compare October 2, 2024 23:22

wujingyue force-pushed the wjy/three branch 3 times, most recently from ee5b81d to 96ed5e8 Compare October 3, 2024 01:44

wujingyue force-pushed the wjy/three branch 2 times, most recently from 3d060ab to c61cdc1 Compare October 3, 2024 04:56

wujingyue marked this pull request as ready for review October 3, 2024 05:04

wujingyue requested review from Priya2698 and cowanmeg and removed request for cowanmeg October 3, 2024 05:04

wujingyue force-pushed the wjy/three branch from c61cdc1 to ed3d831 Compare October 4, 2024 15:08

cowanmeg approved these changes Oct 4, 2024

View reviewed changes

tests/python/opinfo_input_generators.py Show resolved Hide resolved

csrc/ir/nodes.cpp Show resolved Hide resolved

wujingyue added the enhancement New feature or request label Oct 4, 2024

wujingyue added 4 commits October 4, 2024 10:19

Add a repro.

4c61efe

Allow linear to take a >2D weight and a >1D bias.

fba5439

As long as the extra dimensions are DID-parallel.

Fix clangtidy

eb1729e

Relax error threshold.

2a0ee94

wujingyue added 2 commits October 4, 2024 10:19

Cleanup

900e18e

More comments.

986848f

wujingyue force-pushed the wjy/three branch from ed3d831 to 986848f Compare October 4, 2024 17:19

Fix a test.

7f4fc43

wujingyue merged commit 9d39b6c into main Oct 4, 2024
10 of 11 checks passed

wujingyue deleted the wjy/three branch October 4, 2024 23:05

This was referenced Oct 4, 2024

Fix fd.ops.linear to compute the right output rank. #3114

Merged

Fix sdpfa_fwd to not assume the presence of DIDs. #3116

Merged

wujingyue mentioned this pull request Oct 7, 2024

API for sharded linear. #3125

Closed

wujingyue mentioned this pull request Oct 8, 2024

When computing isResharding, ignore parallel types for broadcasts. #3053

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow linear to take a >2D weight and a >1D bias. #3073

Allow linear to take a >2D weight and a >1D bias. #3073

wujingyue commented Oct 1, 2024 •

edited

Loading

wujingyue commented Oct 1, 2024

wujingyue commented Oct 2, 2024

wujingyue commented Oct 2, 2024

wujingyue commented Oct 3, 2024

wujingyue commented Oct 3, 2024

wujingyue commented Oct 4, 2024

cowanmeg left a comment

wujingyue commented Oct 4, 2024

wujingyue commented Oct 4, 2024

Allow linear to take a >2D weight and a >1D bias. #3073

Allow linear to take a >2D weight and a >1D bias. #3073

Conversation

wujingyue commented Oct 1, 2024 • edited Loading

wujingyue commented Oct 1, 2024

wujingyue commented Oct 2, 2024

wujingyue commented Oct 2, 2024

wujingyue commented Oct 3, 2024

wujingyue commented Oct 3, 2024

wujingyue commented Oct 4, 2024

cowanmeg left a comment

Choose a reason for hiding this comment

wujingyue commented Oct 4, 2024

wujingyue commented Oct 4, 2024

wujingyue commented Oct 1, 2024 •

edited

Loading