Tensor parallel MLP #2360

cowanmeg · 2024-06-06T21:48:14Z

Manually sharded tensor parallel multilayer perception layer.

Input is manually translated and sharded mlp layer taken from nanoGPT. See #2199 for where we get the initial compute trace.

…ointwise_scheduler

This is not a full reversion of NVIDIA#2419, which also renamed `INPUT_C` and `OUTPUT_D`, and made some modifications to `map{Linear,Matmul}OpIterDomains`. This preserves those changes but allows us to keep distinguishing A and B operands. Fixes NVIDIA#2434

cowanmeg · 2024-06-24T23:12:17Z

Some followup not addressed by this PR:
(1) linear/matmul node: Replace first linear layer with aten linear and second linear layer with aten matmul. Need to update to allow 2D local weight tensors (3D+ with device dims)

Fuser/csrc/ops/composite.cpp

Line 102 in beb2287

TensorView* linear(TensorView* input, TensorView* weight, TensorView* bias) {

(2) Symbolic TVs: This error is encountered with symbolic TVs C++ exception with description "ext_opt.hasValue() INTERNAL ASSERT FAILED at "csrc/dynamic_transform.cpp":276, Could not evaluate dynamic extent: i3
Exception raised from DynamicTransformConcretizationInfo at
csrc/dynamic_transform.cpp:276
(3) Improved sharding propagation: TVs where sharding propagation breaks down are (1) broadcasts where a device dim axis is broadcasted (2) rand_like operator because it creates a fresh new TV. The current pass assumes that one of the inputs is already sharded and propagates from producer to consumer. To support this we need to propagate up from the consumer.

wujingyue

Almost LGTM. Nice work!

tests/cpp/test_multidevice_matmul.cpp

wujingyue · 2024-06-25T17:11:03Z

tests/cpp/test_multidevice_matmul.cpp

+  int64_t h = 128;
+  int64_t h4 = 4 * h;
+
+  // TODO: error with dynamic shape


Can you clarify this? Are you saying the following code would fail if changed to makeContigTensor?

Correct. This is follow up item (2).

I think it's another instance of #2462. Please revisit when it's fixed.

We are using FusionExecutorCache so sadly that did not fix the error. I did narrow down what is causing the error and opened an issue #2481

wujingyue · 2024-06-25T17:11:27Z

tests/cpp/test_multidevice_matmul.cpp

+  TensorView* gelu_ = castOp(DataType::BFloat16, gelu);
+
+  // Linear #2
+  gelu_ = segment_set(gelu_);


Can you comment on why this is needed?

Without the segment set, the reduction scheduler gets called instead of the matmul scheduler. Can add a comment in the code as well.

btw, IIUC we should use matmul op for both nvfuser matmul and aten matmul in the future and this will address the segmentation issue we see here. We will just need to update our resharding passes to handle matmul and linear ops appropriately.

Can add a comment in the code as well.

That'd always be helpful. Thank you!

update our resharding passes to handle matmul and linear ops appropriately

I'm surprised using matmul/linear changes resharding at all because they are done locally. What do you mean?

Ahh I should clarify, it would be in the insertResharding pass which automatically adds set operations where necessary. The work around is very simple, we just add the set manually.

SG. I'm fine leaving the segment_set as is. However, in the next (or next next :) PR, we should improve insertResharding to avoid the segment_set. Thunder or nvFuser's Python API doesn't have segment_set and it's too low-level for the framework to add segment_set properly anyway. How does that sound?

If we use matmul instead of the broadcast+mul-sum no segment_set is needed, so in the next PR we will have a version with no segment set :)

wujingyue · 2024-06-25T17:12:38Z

tests/cpp/test_multidevice_matmul.cpp

+  TensorView* linear_int2 = mul(linear_int0, linear_int1);
+  TensorView* linear_int3 = sum(linear_int2, {-1});


Use matmul or linear? cc @Priya2698

The first linear layer can be replaced with a linear op because there is no communication in between the bias add and local matmul. The second linear layer will have to be a matmul since the pattern is matmul + allreduce + bias add.
I do have tests with replacing the second linear layer with matmul, but it was generating very high error. My assumption is that the testValidate isn't meant to be used when chaining multiple matmuls together? I also fiddled around with the datatypes of the matmul (i.e. bfloat, float, double) but the error was still high.

matmul + allreduce + bias add

Is it crazy to do matmul + add half bias + all_reduce? This may be slightly faster, but we can explore that later.

it was generating very high error

Does the second linear layer in the PR as is map to pointwise+reduction or aten matmul? If the former, we'll need to fix it soon because pointwise+reduction is likely slower. If the latter, I'll be curious why aten matmul via mul+sum is low error than aten matmul via matmul.

matmul + add half bias + all_reduce
That's not a bad idea! When we start benchmarking I will try this.

Currently, the second linear layer uses nvfuser matmul. I have uncommitted version of the mlp test that uses aten matmul instead that generated a max error of ~3, while testValidate had a threshold of ~e-5. It's just a guess, but I don't think I am setting the datatypes correctly in the aten baseline.

matmul + allreduce + bias add

Also, mainstream (e.g. llama) doesn't seem to use bias. So it might not be relevant.

Currently, the second linear layer uses nvfuser matmul.

SG. In the next PR, let's try to use ATen matmul. This will be how the integration looks like and give the best performance at this moment. Even if the error is slightly higher than the threshold, I'd accept it because (1) we know the error is localized to matmul, and (2) checking in the code allows us to investigate in parallel.

Co-authored-by: Jingyue Wu <[email protected]>

csrc/multidevice/utils.cpp

wujingyue · 2024-06-25T19:23:39Z

tests/cpp/test_multidevice_matmul.cpp

+  TensorView* linear_int2 = mul(linear_int0, linear_int1);
+  TensorView* linear_int3 = sum(linear_int2, {-1});


matmul + allreduce + bias add

Is it crazy to do matmul + add half bias + all_reduce? This may be slightly faster, but we can explore that later.

it was generating very high error

Does the second linear layer in the PR as is map to pointwise+reduction or aten matmul? If the former, we'll need to fix it soon because pointwise+reduction is likely slower. If the latter, I'll be curious why aten matmul via mul+sum is low error than aten matmul via matmul.

wujingyue · 2024-06-25T19:25:26Z

tests/cpp/test_multidevice_matmul.cpp

+  TensorView* gelu_ = castOp(DataType::BFloat16, gelu);
+
+  // Linear #2
+  gelu_ = segment_set(gelu_);


Can add a comment in the code as well.

That'd always be helpful. Thank you!

update our resharding passes to handle matmul and linear ops appropriately

I'm surprised using matmul/linear changes resharding at all because they are done locally. What do you mean?

cowanmeg · 2024-06-25T20:27:41Z

!build

cowanmeg · 2024-06-25T21:14:23Z

@wujingyue - I added the MLP test with aten matmul. Note, that the tolerance is bumped up to a bit to pass validation.

Validation error in output 0 (linear1) on line 583 in file /tests/cpp/test_multidevice_matmul.cpp.
Detected abs error of: 0.122498
absolute tolerance was set to 0.005
and relative tolerance set to 5e-05

Validation error in output 2 (linear2) on line 583 in file tests/cpp/test_multidevice_matmul.cpp.
Detected abs error of: 4.08847
absolute tolerance was set to 2
and relative tolerance set to 0.02

cowanmeg · 2024-06-25T22:25:50Z

!build

wujingyue · 2024-06-26T23:40:02Z

tests/cpp/test_multidevice_matmul.cpp

+  // Linear #1
+  TensorView* matmul1;
+  if (use_aten_matmul) {
+    // TODO: use linear op instead


cc @Priya2698

@cowanmeg reminded me of a practical limitation: currently, we split rfactor for DID, so w0 is 3D. linear (just as torch.linear) doesn't take a 3D weight. This limitation will eventually go away when split leaf instead of rfactor, but will exist likely for the rest of the year.

cowanmeg added 2 commits June 6, 2024 16:02

initial MLP commit

3c3db97

dropout working

6d375ff

cowanmeg marked this pull request as draft June 6, 2024 21:48

cowanmeg and others added 20 commits June 7, 2024 06:55

initial commit, working for outermost DID axis

3c37a86

test and support non-outermost DID

c02ddd4

lint

4349452

feedback

59f9d88

Merge branch 'main' of https://github.com/nvidia/Fuser into sharded_p…

dcb0d26

…ointwise_scheduler

revamp test and fix error

e83fd68

comment fix

92fdee1

propagate shardings in mlp

a765aa4

Merge branch 'main' of https://github.com/nvidia/Fuser into mlp

b488f04

merge. new double buffer error

d7cb419

temp

96935ad

Merge branch 'main' of https://github.com/nvidia/Fuser into mlp

820a5c1

hack fix matmul tensor ordering

a5fb4b7

clean up

ff7afb5

undo

3567125

fix expr evaluator error

199651f

tidy

f541255

merge matmul changes and cleanup test

9ca55a3

lint

1989691

cowanmeg added 4 commits June 25, 2024 01:00

remove aten for error

8f90fb0

Merge branch 'main' of https://github.com/nvidia/Fuser into mlp

c405ee6

Communicator rename

81b9138

Merge branch 'main' of https://github.com/nvidia/Fuser into mlp

2b1a236

cowanmeg marked this pull request as ready for review June 25, 2024 15:56

cowanmeg requested a review from wujingyue June 25, 2024 15:56

cowanmeg requested a review from samnordmann June 25, 2024 15:56

wujingyue reviewed Jun 25, 2024

View reviewed changes

Update tests/cpp/test_multidevice_matmul.cpp

a1d6232

Co-authored-by: Jingyue Wu <[email protected]>

wujingyue reviewed Jun 25, 2024

View reviewed changes

feedback

e8b7f87

Add aten matmul mlp

e8ac854

wujingyue approved these changes Jun 25, 2024

View reviewed changes

Update executor.h comment

b2f2da7

cowanmeg merged commit 1ed0e86 into NVIDIA:main Jun 26, 2024
5 checks passed

wujingyue mentioned this pull request Jun 26, 2024

Large error when using matmul in the distributed matmul test. #2460

Closed

wujingyue reviewed Jun 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensor parallel MLP #2360

Tensor parallel MLP #2360

cowanmeg commented Jun 6, 2024 •

edited

Loading

cowanmeg commented Jun 24, 2024 •

edited

Loading

wujingyue left a comment

wujingyue Jun 25, 2024

cowanmeg Jun 25, 2024

wujingyue Jun 27, 2024

cowanmeg Jun 27, 2024

wujingyue Jun 25, 2024

cowanmeg Jun 25, 2024

cowanmeg Jun 25, 2024

wujingyue Jun 25, 2024

cowanmeg Jun 25, 2024

wujingyue Jun 25, 2024

cowanmeg Jun 25, 2024

wujingyue Jun 25, 2024

cowanmeg Jun 25, 2024

wujingyue Jun 25, 2024

cowanmeg Jun 25, 2024

wujingyue Jun 25, 2024

wujingyue Jun 25, 2024

wujingyue Jun 25, 2024

cowanmeg commented Jun 25, 2024

cowanmeg commented Jun 25, 2024

cowanmeg commented Jun 25, 2024

wujingyue Jun 26, 2024

		TensorView* linear_int2 = mul(linear_int0, linear_int1);
		TensorView* linear_int3 = sum(linear_int2, {-1});

Tensor parallel MLP #2360

Tensor parallel MLP #2360

Conversation

cowanmeg commented Jun 6, 2024 • edited Loading

cowanmeg commented Jun 24, 2024 • edited Loading

wujingyue left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cowanmeg commented Jun 25, 2024

cowanmeg commented Jun 25, 2024

cowanmeg commented Jun 25, 2024

Choose a reason for hiding this comment

cowanmeg commented Jun 6, 2024 •

edited

Loading

cowanmeg commented Jun 24, 2024 •

edited

Loading