Fix reference implementations and improve validation. #2905

wujingyue · 2024-09-04T20:20:56Z

Fix some reference implementations to cast in the same way as nvFuser.
Allow validate to take tolerances.
Refine the tolerances so the tests pass up to 6 GPUs.

wujingyue · 2024-09-04T21:15:01Z

!build

wujingyue · 2024-09-04T21:15:32Z

tests/cpp/test_multidevice_transformer.cpp

-  auto gelu = at::gelu(linear0, "tanh");
-  auto linear1 =
-      at::matmul(gelu.to(at_dtype), w1).to(at::kFloat) + b1.to(at::kFloat);
+  auto linear0 = at::matmul(x, w0).to(at::kFloat) + b0;


A no-op change. + in PyTorch promotes data types.

wujingyue · 2024-09-04T21:16:34Z

tests/cpp/test_multidevice_transformer.cpp

-  auto linear1 =
-      at::matmul(gelu.to(at_dtype), w1).to(at::kFloat) + b1.to(at::kFloat);
+  auto linear0 = at::matmul(x, w0).to(at::kFloat) + b0;
+  auto gelu = at::gelu(linear0, "tanh").to(at_dtype);


This is necessary for at::allclose to pass because mlp generates the output of at_dtype.

cowanmeg · 2024-09-05T22:57:30Z

tests/cpp/test_multidevice_transformer.cpp

-  auto outputs = runtime.runWithInput(inputs);
-  validate(expected_outputs, outputs);
+  auto outputs = fec.runFusionWithInputs(inputs);
+  ASSERT_EQ(outputs.size(), expected_outputs.size());


Why not just use the existing validate and change the tolerance?

I'll try to do that later. This PR only fixes the ref of MLP, and validate is used by other tests as well.

you could parameterize it with the lower atol. I imagine if we want tighter bounds for all the tests each one will have it's own value that is hard to predict without manually trying

wujingyue · 2024-09-07T05:34:09Z

csrc/ops/composite.cpp

I suspected some inaccuracy and made these changes. Although they turned out not to affect accuracy, I left them in the PR as general cleanups.

wujingyue · 2024-09-07T05:37:28Z

!build

cowanmeg · 2024-09-07T07:06:01Z

tests/cpp/test_multidevice_transformer.cpp

        << "Output " << i << " has a mismatching data type.";

    // Note: Scaling tolerance up since the error accumulates across ops
    // BFloat16 error is quite high, but the program has been verified with
    // double precision to be logically correct.
-    const double atol = 0.075 * (i + 1);
-    const double rtol = 1.6e-2;


Why did you remove rtol? Some of the absolute errors are 1e-5 which is a very small error of margin for bfloat16

It would also be beneficial to keep some form of default value, since tests are still getting added this adds to a lot of effort to get a simple example working.

I removed it because I found I ended up always setting rtol to 0 :)

I'd love to keep some default value to, as you said, minimize the effort to get a simple example working. What would be a good default value? The old atol=0.075*(i+1) is too relaxed after I fixed the reference implementation. The *(i+1) part is also problematic because the output number doesn't necessarily match the layer number.

How about this:

Hardcode rtol to 0.016. I'll try to finetune this value a little bit -- 0.016 sounds like a large default rtol to start with.

Still require the caller to provide a list of per-output atols, because it's hard to fine a good default and each test seems to require something different.

It was supposed to have been taken from Pytorch. https://github.com/pytorch/pytorch/blob/042f2f7746a064f1527d95d1f1d712b4f0b34186/test/test_transformers.py#L85
But should be .0126.

Done -- added default rtols.

cowanmeg · 2024-09-07T07:08:03Z

csrc/ops/composite.cpp


-  auto tanh_inner_sq = mul(tanh_inner, tanh_inner);
-  auto tanh_derivative = sub(IrBuilder::create<Val>(1.0), tanh_inner_sq);
+  auto left_derivative = mul(half, right);



Why is this rewrite necessary? You mentioned it didn't affect accuracy like you had hypothesized so I'm wondering what this brings us?

It's not necessary and I'm happy to revert it. I kept it for only two reasons:

Code becomes shorter.

Code matches the PyTorch implementation more closely, which I hypothesized would help accuracy.

wujingyue · 2024-09-16T06:02:16Z

!build

wujingyue mentioned this pull request Sep 4, 2024

Avoiding having to manually specify the comparison threshold between a sharded implementation and its reference. #2906

Open

wujingyue requested a review from cowanmeg September 4, 2024 21:15

wujingyue commented Sep 4, 2024

View reviewed changes

wujingyue mentioned this pull request Sep 5, 2024

Add script to generate val consts #2900

Merged

cowanmeg reviewed Sep 5, 2024

View reviewed changes

wujingyue requested a review from cowanmeg September 5, 2024 23:00

wujingyue force-pushed the wjy/fec branch from 35ba5a8 to f9c603e Compare September 6, 2024 17:10

wujingyue changed the base branch from main to wjy/more September 6, 2024 17:11

wujingyue marked this pull request as draft September 6, 2024 17:11

wujingyue force-pushed the wjy/fec branch from f9c603e to fc28aa1 Compare September 6, 2024 18:17

Base automatically changed from wjy/more to main September 6, 2024 20:10

wujingyue force-pushed the wjy/fec branch from fc28aa1 to ebdd33d Compare September 7, 2024 04:56

wujingyue added 7 commits September 6, 2024 22:22

Fix reference implementation.

815b661

Naming changes.

658c5d9

Remove an unnecessary cast.

c6b3427

Clean tanh_gelu_backward.

684c8b3

Set tolerances manually.

83d472f

Print atol and rtol during validate.

98b4fbd

Refine the tolerances for -np 4.

14ac700

wujingyue force-pushed the wjy/fec branch from ebdd33d to 14ac700 Compare September 7, 2024 05:22

Refine the tolerances for -np 6.

5d8833b

wujingyue changed the title ~~Use FusionExecutorCache for MLP_Layer tests.~~ Fix reference implementations and improve validation. Sep 7, 2024

wujingyue marked this pull request as ready for review September 7, 2024 05:30

wujingyue commented Sep 7, 2024

View reviewed changes

Remove rtol because it's not used.

d00686b

cowanmeg reviewed Sep 7, 2024

View reviewed changes

wujingyue requested a review from cowanmeg September 7, 2024 21:38

wujingyue added the bug Something isn't working label Sep 7, 2024

wujingyue added 2 commits September 15, 2024 22:31

Add back a default rtol.

402b40d

Use default rtols for some other dtypes.

bbfcf48

wujingyue force-pushed the wjy/fec branch from ee840c4 to bbfcf48 Compare September 16, 2024 06:01

cowanmeg approved these changes Sep 16, 2024

View reviewed changes

wujingyue merged commit 3fcde3b into main Sep 16, 2024
34 of 36 checks passed

wujingyue deleted the wjy/fec branch September 16, 2024 14:37

wujingyue mentioned this pull request Nov 1, 2024

Large error when using matmul in the distributed matmul test. #2460

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix reference implementations and improve validation. #2905

Fix reference implementations and improve validation. #2905

wujingyue commented Sep 4, 2024 •

edited

Loading

wujingyue commented Sep 4, 2024

wujingyue Sep 4, 2024

wujingyue Sep 4, 2024

cowanmeg Sep 5, 2024

wujingyue Sep 5, 2024

cowanmeg Sep 5, 2024

wujingyue Sep 7, 2024

wujingyue commented Sep 7, 2024

cowanmeg Sep 7, 2024

cowanmeg Sep 7, 2024

wujingyue Sep 7, 2024

cowanmeg Sep 9, 2024

wujingyue Sep 16, 2024

cowanmeg Sep 7, 2024

wujingyue Sep 7, 2024

wujingyue commented Sep 16, 2024

Fix reference implementations and improve validation. #2905

Fix reference implementations and improve validation. #2905

Conversation

wujingyue commented Sep 4, 2024 • edited Loading

wujingyue commented Sep 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wujingyue commented Sep 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wujingyue commented Sep 16, 2024

wujingyue commented Sep 4, 2024 •

edited

Loading