MarkAliasesPrepare attempts to put segment_set after intermediate tensors. #2639

wujingyue · 2024-07-19T00:03:44Z

Summary

This PR improves segmentation around meta ops for #2599. It mimics Thunder's bookend optimization, although only the output end is implemented at this moment. Although the implementation isn't complete, I believe it's an improvement overall. It segments out meta ops more aggressively than before, making non-meta ops easier to schedule.

Potential problems

segment_set enforces the segmenter to give up scheduling the complete fusion and instead trying to merge from singletons. In theory, merging should be able to reach a "local maxima". But in practice it's not always as powerful as expected, e.g., https://github.com/NVIDIA/Fuser/pull/2639/files#diff-cf6d314e82c49485e9039e700a04a678904e5c669abcd1610ade897ca8f57c78R1472.
MarkAliasesPreparePass may set an suboptimal layout, because it runs before AllocationDomainPass and reasons about only meta ops. For example,
```
// `in` has a column-major input matrix. 
x = PointwiseUnary(in);
out = Transpose(x);
```
MarkAliasesPreparePass will make x row-major, the default layout, and out column-major. Although Transpose is made no-op, PointwiseUnary becomes a pointwise kernel with transposition rather than a streamline pointwise kernel. I think this can be fixed by combining the allocation-domain part of MarkAliasesPreparePass into AllocationDomainPass and of course running AllocationDomainPass before.

Performance comparison with TOT

https://gist.github.com/wujingyue/afdc3e89e693ff6bbc18271c96409b94 is the performance comparison before the PR and after. This is collected by running the following commands.

$ python tools/benchmark_thunder.py --storage ~/workspace --filter='test_litgpt_qkv_split_rope[phi-2-backward-bs1-thunder] or test_litgpt_gelu[phi-2-backward-bs1-thunder] or test_batch_norm[backward-thunder]' main:main main:bug2599
$ pytest-benchmark --storage=$HOME/workspace compare 0011 0012 --group-by=name

Most benchmarks are neutral, which is as expected because bookend is still on. Several benchmarks (namely, test_litgpt_qkv_split_rope[phi-2-backward-bs1-thunder], test_litgpt_gelu[phi-2-backward-bs1-thunder] and test_batch_norm[backward-thunder]) showed some regression. However, the regression disappeared the second time I ran the benchmarks, so it's likely a noise.

wujingyue · 2024-07-19T23:21:47Z

!build

jjsjann123 · 2024-07-19T23:44:56Z

it's conflicting with @liqiangxl 's earlier PR 😛

Do you mind a rebase 🙇

This was a question I asked myself when working on #2639. Knowing this allows me to simplify some segment-set-inserting logic there.

wujingyue · 2024-07-21T03:30:40Z

!build

wujingyue · 2024-07-21T03:32:55Z

Do you mind a rebase 🙇

Done! cc @jjsjann123

wujingyue · 2024-07-21T17:14:16Z

!build

wujingyue · 2024-07-22T06:52:45Z

!build

This fixes a problem that's exposed by #2639, and is AFAIK the right order of actions. We almost never run preseg passes after scheduling, except for multi-GPU.

wujingyue · 2024-07-25T06:37:25Z

tests/cpp/test_gpu_view.cpp

@@ -2379,6 +2398,63 @@ TEST_F(GpuViewTest, SplitMergePointwiseSplitMerge) {
  testValidate(executor_cache.fusion(), {cg_outputs}, {t0}, __LINE__, __FILE__);
 }

+// segmented into 2 kernels: pointwise and reduction
+TEST_F(GpuViewTest, GroupNormOriginal) {


Copied from #2405.

jjsjann123

LGTM.

Thanks for modifying existing tests to work with the new analysis. That looked painful to go over each.

jjsjann123 · 2024-07-25T21:32:03Z

csrc/scheduler/mark_aliases.cpp

-    if (aliased_io == nullptr) {
-      continue;
+    if (TensorView* aliased_io = analysis.getRoot(out)) {
+      if (aliased_io->isFusionInput() || aliased_io->isFusionOutput()) {


Note for myself. I think we need to where we could use pad, slice, segmenter_set to as a hint a potential kernel IO tensor..

tests/cpp/test_alias.cpp

jjsjann123 · 2024-07-26T10:10:02Z

tests/cpp/test_alias.cpp

+  EXPECT_TRUE(out_tensors[2].is_alias_of(out_tensors[1]));
+}
+
+TEST_F(AliasTest, Bookend_Issue2375) {


cc'ing @liqiangxl

liqiangxl

LGTM. Thanks for adding this powerful feature!

csrc/preseg_passes/mark_aliases_prepare.cpp

liqiangxl · 2024-07-26T14:05:58Z

csrc/preseg_passes/mark_aliases_prepare.cpp

+  // inserted.
+  //
+  // Group `uses_to_segment` by `use_of` and remove duplicates.
+  std::sort(uses_to_segment.begin(), uses_to_segment.end());


we usually use stable_sort in nvFuser for reproductivity.

It makes no difference because the pairs <use_of,user> in uses_to_segment are distinct after the call to std::unique.

But I got your concern on non-determinism, of which the root cause is sorting by pointer values. What sort keys would you use? Names?

I think name() should work.

csrc/preseg_passes/mark_aliases_prepare.cpp

tests/cpp/test_gather.cpp

liqiangxl · 2024-07-26T14:46:24Z

tests/cpp/test_resize.cpp

@@ -2000,7 +2001,8 @@ TEST_F(ResizeTest, ResizeReshapeAndSlice) {
      tv1,
      {{IrBuilder::create<Val>(0L), IrBuilder::create<Val>(2L)},
       {IrBuilder::create<Val>(0L), IrBuilder::create<Val>(2L)}});
-  fusion->addOutput(tv2);
+  auto tv3 = add(tv2, tv2);


can we add this comment to the code?

liqiangxl · 2024-07-26T14:53:51Z

(1) We may need to check the influence on the CPU overhead of segmenter since segment_set enforces the segmenter to give up scheduling the complete fusion and instead trying to merge from singletons.
(2) When merged into main branch, needs to check code diff to avoid over segmentation similar to several tests modified in this PR.

wujingyue · 2024-07-26T18:52:35Z

!build --diff

wujingyue · 2024-07-27T05:27:54Z

When merged into main branch, needs to check code diff to avoid over segmentation similar to several tests modified in this PR.

I don't understand the result of codegen_diff any more. E.g., https://gitlab-master.nvidia.com/dl/pytorch/fuser-gh-mirror/-/jobs/103497030/viewer failed but I couldn't find which test case failed and what was the diff.

…oup norm segmentation issue #2375 (#2405)" This reverts commit 15bdf9f.

wujingyue · 2024-07-27T05:48:53Z

!build --diff

wujingyue marked this pull request as draft July 19, 2024 00:03

wujingyue force-pushed the bug2599 branch 5 times, most recently from 869e250 to 0fa08fe Compare July 19, 2024 23:21

jjsjann123 mentioned this pull request Jul 20, 2024

Allow alias analysis mark candidate on segmented fusion inputs #2608

Closed

wujingyue added a commit that referenced this pull request Jul 21, 2024

Add a test to verify a segment can hold multiple segment_sets.

ed5690b

This was a question I asked myself when working on #2639. Knowing this allows me to simplify some segment-set-inserting logic there.

wujingyue added a commit that referenced this pull request Jul 21, 2024

Add a test to verify a segment can hold multiple segment_sets.

6d070ea

This was a question I asked myself when working on #2639. Knowing this allows me to simplify some segment-set-inserting logic there.

wujingyue mentioned this pull request Jul 21, 2024

Add a test to verify a segment can hold multiple segment_sets. #2653

Merged

wujingyue force-pushed the bug2599 branch 2 times, most recently from 3aad384 to 39a9c43 Compare July 21, 2024 03:24

wujingyue changed the base branch from main to wjy/revert July 21, 2024 03:24

wujingyue force-pushed the bug2599 branch from 39a9c43 to a456b03 Compare July 21, 2024 03:30

wujingyue force-pushed the bug2599 branch from a456b03 to 97e665b Compare July 21, 2024 07:19

wujingyue mentioned this pull request Jul 25, 2024

The no-op scheduler clears memory space. #2679

Merged

wujingyue added a commit that referenced this pull request Jul 25, 2024

Run preseg before schedule.

35a4c68

This fixes a problem that's exposed by #2639, and is AFAIK the right order of actions. We almost never run preseg passes after scheduling, except for multi-GPU.

wujingyue mentioned this pull request Jul 25, 2024

Run preseg before schedule. #2680

Merged

wujingyue force-pushed the bug2599 branch from 26c6c96 to 233d4e9 Compare July 25, 2024 06:15

wujingyue changed the title ~~WIP~~ MarkAliasesPrepare attempts to put segment_set after intermediate tensors. Jul 25, 2024

wujingyue requested review from liqiangxl and jjsjann123 July 25, 2024 06:35

wujingyue marked this pull request as ready for review July 25, 2024 06:35

wujingyue commented Jul 25, 2024

View reviewed changes

jjsjann123 approved these changes Jul 26, 2024

View reviewed changes

liqiangxl approved these changes Jul 26, 2024

View reviewed changes

wujingyue force-pushed the wjy/revert branch from 3b58749 to 03769c7 Compare July 26, 2024 18:50

wujingyue force-pushed the bug2599 branch from d6e8535 to cc897e6 Compare July 26, 2024 18:51

wujingyue changed the base branch from wjy/revert to main July 27, 2024 05:47

wujingyue added 14 commits July 27, 2024 05:48

Revert "Allow output to alias intermediate tensor, step-2 to solve gr…

7f15a47

…oup norm segmentation issue #2375 (#2405)" This reverts commit 15bdf9f.

Add unit tests.

8d905b4

Tests are passing.

419323e

Add back the unit tests from #2405.

441546e

Change several tests to accomodate the stronger MarkAliasesPrepare.

b564ddf

Fix for SegmentationTest.InputForwardingUntilBinary.

2ff7bdf

Change tests to work around a segmenter limitation.

80bb70b

Remove a redundant set.

5ed6971

Remove AliasAnalysisResult::source.

bca63ca

s/getNearestAliasedIo/getRoot/g

bf26e8a

s/pair/Use

097fe51

Minor

37ad0a4

Move isSegmentSet to ir_utils.

20f5b73

Review comments.

ec79fc5

wujingyue force-pushed the bug2599 branch from cc897e6 to ec79fc5 Compare July 27, 2024 05:48

wujingyue merged commit fe34321 into main Jul 27, 2024
32 of 36 checks passed

wujingyue deleted the bug2599 branch July 27, 2024 14:05

This was referenced Jul 30, 2024

Backward Tensor parallel MLP #2676

Merged

MarkAliasPrepare does not preserve shardings #2721

Closed

wujingyue mentioned this pull request Aug 20, 2024

MarkAliasesPrepare applies bookend from inputs as well as outputs. #2815

Closed

wujingyue added the enhancement New feature or request label Sep 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MarkAliasesPrepare attempts to put segment_set after intermediate tensors. #2639

MarkAliasesPrepare attempts to put segment_set after intermediate tensors. #2639

wujingyue commented Jul 19, 2024 •

edited

Loading

wujingyue commented Jul 19, 2024

jjsjann123 commented Jul 19, 2024

wujingyue commented Jul 21, 2024

wujingyue commented Jul 21, 2024

wujingyue commented Jul 21, 2024

wujingyue commented Jul 22, 2024

wujingyue Jul 25, 2024

jjsjann123 left a comment

jjsjann123 Jul 25, 2024

jjsjann123 Jul 26, 2024

liqiangxl left a comment

liqiangxl Jul 26, 2024

wujingyue Jul 26, 2024

liqiangxl Jul 26, 2024

liqiangxl Jul 26, 2024

liqiangxl commented Jul 26, 2024

wujingyue commented Jul 26, 2024

wujingyue commented Jul 27, 2024

wujingyue commented Jul 27, 2024

MarkAliasesPrepare attempts to put segment_set after intermediate tensors. #2639

MarkAliasesPrepare attempts to put segment_set after intermediate tensors. #2639

Conversation

wujingyue commented Jul 19, 2024 • edited Loading

Summary

Potential problems

Performance comparison with TOT

wujingyue commented Jul 19, 2024

jjsjann123 commented Jul 19, 2024

wujingyue commented Jul 21, 2024

wujingyue commented Jul 21, 2024

wujingyue commented Jul 21, 2024

wujingyue commented Jul 22, 2024

wujingyue Jul 25, 2024

Choose a reason for hiding this comment

jjsjann123 left a comment

Choose a reason for hiding this comment

jjsjann123 Jul 25, 2024

Choose a reason for hiding this comment

jjsjann123 Jul 26, 2024

Choose a reason for hiding this comment

liqiangxl left a comment

Choose a reason for hiding this comment

liqiangxl Jul 26, 2024

Choose a reason for hiding this comment

wujingyue Jul 26, 2024

Choose a reason for hiding this comment

liqiangxl Jul 26, 2024

Choose a reason for hiding this comment

liqiangxl Jul 26, 2024

Choose a reason for hiding this comment

liqiangxl commented Jul 26, 2024

wujingyue commented Jul 26, 2024

wujingyue commented Jul 27, 2024

wujingyue commented Jul 27, 2024

wujingyue commented Jul 19, 2024 •

edited

Loading