Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MarkAliasesPrepare attempts to put segment_set after intermediate tensors. #2639

Merged
merged 14 commits into from
Jul 27, 2024

Conversation

wujingyue
Copy link
Collaborator

@wujingyue wujingyue commented Jul 19, 2024

Summary

This PR improves segmentation around meta ops for #2599. It mimics Thunder's bookend optimization, although only the output end is implemented at this moment. Although the implementation isn't complete, I believe it's an improvement overall. It segments out meta ops more aggressively than before, making non-meta ops easier to schedule.

Potential problems

  1. segment_set enforces the segmenter to give up scheduling the complete fusion and instead trying to merge from singletons. In theory, merging should be able to reach a "local maxima". But in practice it's not always as powerful as expected, e.g., https://github.com/NVIDIA/Fuser/pull/2639/files#diff-cf6d314e82c49485e9039e700a04a678904e5c669abcd1610ade897ca8f57c78R1472.
  2. MarkAliasesPreparePass may set an suboptimal layout, because it runs before AllocationDomainPass and reasons about only meta ops. For example,
    // `in` has a column-major input matrix. 
    x = PointwiseUnary(in);
    out = Transpose(x);
    
    MarkAliasesPreparePass will make x row-major, the default layout, and out column-major. Although Transpose is made no-op, PointwiseUnary becomes a pointwise kernel with transposition rather than a streamline pointwise kernel. I think this can be fixed by combining the allocation-domain part of MarkAliasesPreparePass into AllocationDomainPass and of course running AllocationDomainPass before.

Performance comparison with TOT

https://gist.github.com/wujingyue/afdc3e89e693ff6bbc18271c96409b94 is the performance comparison before the PR and after. This is collected by running the following commands.

$ python tools/benchmark_thunder.py --storage ~/workspace --filter='test_litgpt_qkv_split_rope[phi-2-backward-bs1-thunder] or test_litgpt_gelu[phi-2-backward-bs1-thunder] or test_batch_norm[backward-thunder]' main:main main:bug2599
$ pytest-benchmark --storage=$HOME/workspace compare 0011 0012 --group-by=name

Most benchmarks are neutral, which is as expected because bookend is still on. Several benchmarks (namely, test_litgpt_qkv_split_rope[phi-2-backward-bs1-thunder], test_litgpt_gelu[phi-2-backward-bs1-thunder] and test_batch_norm[backward-thunder]) showed some regression. However, the regression disappeared the second time I ran the benchmarks, so it's likely a noise.

@wujingyue wujingyue marked this pull request as draft July 19, 2024 00:03
@wujingyue wujingyue force-pushed the bug2599 branch 5 times, most recently from 869e250 to 0fa08fe Compare July 19, 2024 23:21
@wujingyue
Copy link
Collaborator Author

!build

@jjsjann123
Copy link
Collaborator

it's conflicting with @liqiangxl 's earlier PR 😛

Do you mind a rebase 🙇

wujingyue added a commit that referenced this pull request Jul 21, 2024
This was a question I asked myself when working on #2639. Knowing this
allows me to simplify some segment-set-inserting logic there.
wujingyue added a commit that referenced this pull request Jul 21, 2024
This was a question I asked myself when working on #2639. Knowing this
allows me to simplify some segment-set-inserting logic there.
@wujingyue wujingyue force-pushed the bug2599 branch 2 times, most recently from 3aad384 to 39a9c43 Compare July 21, 2024 03:24
@wujingyue wujingyue changed the base branch from main to wjy/revert July 21, 2024 03:24
@wujingyue
Copy link
Collaborator Author

!build

@wujingyue
Copy link
Collaborator Author

Do you mind a rebase 🙇

Done! cc @jjsjann123

@wujingyue
Copy link
Collaborator Author

!build

1 similar comment
@wujingyue
Copy link
Collaborator Author

!build

wujingyue added a commit that referenced this pull request Jul 25, 2024
This fixes a problem that's exposed by #2639, and is AFAIK the right
order of actions. We almost never run preseg passes after scheduling,
except for multi-GPU.
@wujingyue wujingyue changed the title WIP MarkAliasesPrepare attempts to put segment_set after intermediate tensors. Jul 25, 2024
@wujingyue wujingyue marked this pull request as ready for review July 25, 2024 06:35
@@ -2379,6 +2398,63 @@ TEST_F(GpuViewTest, SplitMergePointwiseSplitMerge) {
testValidate(executor_cache.fusion(), {cg_outputs}, {t0}, __LINE__, __FILE__);
}

// segmented into 2 kernels: pointwise and reduction
TEST_F(GpuViewTest, GroupNormOriginal) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copied from #2405.

Copy link
Collaborator

@jjsjann123 jjsjann123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Thanks for modifying existing tests to work with the new analysis. That looked painful to go over each.

if (aliased_io == nullptr) {
continue;
if (TensorView* aliased_io = analysis.getRoot(out)) {
if (aliased_io->isFusionInput() || aliased_io->isFusionOutput()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for myself. I think we need to where we could use pad, slice, segmenter_set to as a hint a potential kernel IO tensor..

tests/cpp/test_alias.cpp Show resolved Hide resolved
EXPECT_TRUE(out_tensors[2].is_alias_of(out_tensors[1]));
}

TEST_F(AliasTest, Bookend_Issue2375) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc'ing @liqiangxl

Copy link
Collaborator

@liqiangxl liqiangxl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for adding this powerful feature!

csrc/preseg_passes/mark_aliases_prepare.cpp Outdated Show resolved Hide resolved
// inserted.
//
// Group `uses_to_segment` by `use_of` and remove duplicates.
std::sort(uses_to_segment.begin(), uses_to_segment.end());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we usually use stable_sort in nvFuser for reproductivity.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes no difference because the pairs <use_of,user> in uses_to_segment are distinct after the call to std::unique.

But I got your concern on non-determinism, of which the root cause is sorting by pointer values. What sort keys would you use? Names?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think name() should work.

csrc/preseg_passes/mark_aliases_prepare.cpp Outdated Show resolved Hide resolved
csrc/preseg_passes/mark_aliases_prepare.cpp Outdated Show resolved Hide resolved
tests/cpp/test_gather.cpp Outdated Show resolved Hide resolved
@@ -2000,7 +2001,8 @@ TEST_F(ResizeTest, ResizeReshapeAndSlice) {
tv1,
{{IrBuilder::create<Val>(0L), IrBuilder::create<Val>(2L)},
{IrBuilder::create<Val>(0L), IrBuilder::create<Val>(2L)}});
fusion->addOutput(tv2);
auto tv3 = add(tv2, tv2);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add this comment to the code?

@liqiangxl
Copy link
Collaborator

(1) We may need to check the influence on the CPU overhead of segmenter since segment_set enforces the segmenter to give up scheduling the complete fusion and instead trying to merge from singletons.
(2) When merged into main branch, needs to check code diff to avoid over segmentation similar to several tests modified in this PR.

@wujingyue
Copy link
Collaborator Author

!build --diff

@wujingyue
Copy link
Collaborator Author

When merged into main branch, needs to check code diff to avoid over segmentation similar to several tests modified in this PR.

I don't understand the result of codegen_diff any more. E.g., https://gitlab-master.nvidia.com/dl/pytorch/fuser-gh-mirror/-/jobs/103497030/viewer failed but I couldn't find which test case failed and what was the diff.

@wujingyue wujingyue changed the base branch from wjy/revert to main July 27, 2024 05:47
@wujingyue
Copy link
Collaborator Author

!build --diff

@wujingyue wujingyue merged commit fe34321 into main Jul 27, 2024
32 of 36 checks passed
@wujingyue wujingyue deleted the bug2599 branch July 27, 2024 14:05
@wujingyue wujingyue added the enhancement New feature or request label Sep 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants