-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MarkAliasesPrepare attempts to put segment_set after intermediate tensors. #2639
Conversation
869e250
to
0fa08fe
Compare
!build |
it's conflicting with @liqiangxl 's earlier PR 😛 Do you mind a rebase 🙇 |
This was a question I asked myself when working on #2639. Knowing this allows me to simplify some segment-set-inserting logic there.
This was a question I asked myself when working on #2639. Knowing this allows me to simplify some segment-set-inserting logic there.
3aad384
to
39a9c43
Compare
!build |
Done! cc @jjsjann123 |
!build |
1 similar comment
!build |
This fixes a problem that's exposed by #2639, and is AFAIK the right order of actions. We almost never run preseg passes after scheduling, except for multi-GPU.
tests/cpp/test_gpu_view.cpp
Outdated
@@ -2379,6 +2398,63 @@ TEST_F(GpuViewTest, SplitMergePointwiseSplitMerge) { | |||
testValidate(executor_cache.fusion(), {cg_outputs}, {t0}, __LINE__, __FILE__); | |||
} | |||
|
|||
// segmented into 2 kernels: pointwise and reduction | |||
TEST_F(GpuViewTest, GroupNormOriginal) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copied from #2405.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Thanks for modifying existing tests to work with the new analysis. That looked painful to go over each.
if (aliased_io == nullptr) { | ||
continue; | ||
if (TensorView* aliased_io = analysis.getRoot(out)) { | ||
if (aliased_io->isFusionInput() || aliased_io->isFusionOutput()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for myself. I think we need to where we could use pad
, slice
, segmenter_set
to as a hint a potential kernel IO tensor..
EXPECT_TRUE(out_tensors[2].is_alias_of(out_tensors[1])); | ||
} | ||
|
||
TEST_F(AliasTest, Bookend_Issue2375) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc'ing @liqiangxl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for adding this powerful feature!
// inserted. | ||
// | ||
// Group `uses_to_segment` by `use_of` and remove duplicates. | ||
std::sort(uses_to_segment.begin(), uses_to_segment.end()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we usually use stable_sort
in nvFuser for reproductivity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes no difference because the pairs <use_of,user>
in uses_to_segment
are distinct after the call to std::unique
.
But I got your concern on non-determinism, of which the root cause is sorting by pointer values. What sort keys would you use? Names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think name()
should work.
@@ -2000,7 +2001,8 @@ TEST_F(ResizeTest, ResizeReshapeAndSlice) { | |||
tv1, | |||
{{IrBuilder::create<Val>(0L), IrBuilder::create<Val>(2L)}, | |||
{IrBuilder::create<Val>(0L), IrBuilder::create<Val>(2L)}}); | |||
fusion->addOutput(tv2); | |||
auto tv3 = add(tv2, tv2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add this comment to the code?
(1) We may need to check the influence on the CPU overhead of segmenter since |
!build --diff |
I don't understand the result of codegen_diff any more. E.g., https://gitlab-master.nvidia.com/dl/pytorch/fuser-gh-mirror/-/jobs/103497030/viewer failed but I couldn't find which test case failed and what was the diff. |
!build --diff |
Summary
This PR improves segmentation around meta ops for #2599. It mimics Thunder's bookend optimization, although only the output end is implemented at this moment. Although the implementation isn't complete, I believe it's an improvement overall. It segments out meta ops more aggressively than before, making non-meta ops easier to schedule.
Potential problems
segment_set
enforces the segmenter to give up scheduling the complete fusion and instead trying to merge from singletons. In theory, merging should be able to reach a "local maxima". But in practice it's not always as powerful as expected, e.g., https://github.com/NVIDIA/Fuser/pull/2639/files#diff-cf6d314e82c49485e9039e700a04a678904e5c669abcd1610ade897ca8f57c78R1472.AllocationDomainPass
and reasons about only meta ops. For example,x
row-major, the default layout, andout
column-major. AlthoughTranspose
is made no-op,PointwiseUnary
becomes a pointwise kernel with transposition rather than a streamline pointwise kernel. I think this can be fixed by combining the allocation-domain part of MarkAliasesPreparePass into AllocationDomainPass and of course running AllocationDomainPass before.Performance comparison with TOT
https://gist.github.com/wujingyue/afdc3e89e693ff6bbc18271c96409b94 is the performance comparison before the PR and after. This is collected by running the following commands.
Most benchmarks are neutral, which is as expected because bookend is still on. Several benchmarks (namely,
test_litgpt_qkv_split_rope[phi-2-backward-bs1-thunder]
,test_litgpt_gelu[phi-2-backward-bs1-thunder]
andtest_batch_norm[backward-thunder]
) showed some regression. However, the regression disappeared the second time I ran the benchmarks, so it's likely a noise.