pointwise scheduler fails to validate reference tv #3513

jjsjann123 · 2024-12-02T22:26:47Z

When picking reference tv, pointwise scheduler fails to validate that the transformation on reference tv can be safely propagated to all outputs in the fusion. The issue occurs when an IterDomain that's not in the reference tv is merged with another dimension in the output tv, preventing the merge on reference tv to be propagated to the target.

This PR adds an optional check areAllOutputIdsMappedTo in nvfuser::pointwise_utils::DomainMap::isValidReference

The added check in this PR checks that all source producer IterDomain producing the IterDomain on outputs are covered by reference tv. This is safe for pointwise scheduler, since the scheduler checks that there's no reversible view present in the fusion.

The check is optional and is disabled by transpose scheduler, where the reference_tv is not supposed to cover the entire fusion, but rather a subset of fusion IO tensors. We should extent that in future PRs.

jjsjann123 · 2024-12-02T22:26:56Z

!test

This reverts commit 7333806.

jjsjann123 · 2024-12-05T01:23:55Z

!test

jjsjann123 · 2024-12-05T01:24:07Z

🤞

jjsjann123 · 2024-12-05T10:36:52Z

!test

jjsjann123 · 2024-12-12T03:10:51Z

Accidentally hit this old issue again when I was playing with slice. #2514 (comment)
I don't think that one should count towards reference tv selection.

jjsjann123 · 2024-12-12T03:25:35Z

!test --diff-bench

naoyam · 2024-12-12T05:41:25Z

But now I'm worried that this check might not work for pad, where we go from a broadcast ID to non-broadcast ID... I'm checking that with a test.

Ah, this actually can cause an issue. For example, suppose we pick a tensor as a reference that has a broadcast ID, and that broadcast ID comes from a fusion input tensor. Suppose that broadcast ID is also used by a pad op, generating a non-broadcast ID, and that non-broadcast ID is NOT included in the reference.

More specifically:

t0 = [i0, b1] // fusion input
t1 = [i2, i3, i4, b5] // fusion input

// t1: [b6, b7, i0, b1]
t2 = broadcast(t0, [true, true, false, false]) 

t3 = add(t1, t2) // fusion output


// t4: [i0, i8]
t4 = pad(t0) // fusion output

Here, suppose we choose t3 as a reference. Clearly, we can't schedule the i8 domain of t4. However, since the propagation back from i8 reaches at b1, which is also reachable from t3, the reference validation would miss flagging the invalid reference.

jjsjann123 · 2024-12-12T06:13:06Z

Clearly, we can't schedule the i8 domain of t4.

I was worried about the same thing and I was thinking about changing the get_source_iter_domains here to add output of resize to also be treated as sources.

But turns out we don't need that.
Transform propagation can actually schedule fusion like that from t3.

You can look at the other example I added for pad (there's a typo in the comment on tv0, I'll fix that). It's very similar to yours
I also just tried you example and the scheduling seems to work. I'll add it to the test as well.

I think the difference between this example and the original issue we had is due to resize. Which links the input ID to the output ID, so propagation of transformation works (I think?!) I was planning to ask you that question tomorrow, since it's getting late when I ran into this.

jjsjann123 · 2024-12-12T06:28:40Z

!test --diff-bench

jjsjann123 · 2024-12-12T06:29:39Z

tests/cpp/test_pointwise.cpp

+  auto fusion_ptr = std::make_unique<Fusion>();
+  auto fusion = fusion_ptr.get();
+  FusionGuard fg(fusion);
+


If I'm not mistaken, this is the example you brought up in comment @naoyam

wait... the definition isn't right....

Hmm, something looks off to me. For example, padding seems to be done with i0 as its input, so this doesn't seem like a good example of padding a broadcast ID.

Note that the order of the padding widths is from inner to outer, just like the PyTorch pad.

sorry I got confused. I think your example looks like mine where the broadcast IDs mapped between the two operator.

The example i had here is actually slightly different and it doesn't map. Let me play with this one a bit more.

ah, so after I fix the example. Turns out I'm causing a regression now -> the added example can't be scheduled as a single fusion; while in ToT the example compiles and runs as a single fusion.

I'm trying a small refactor to avoid that.

Playing with slice/pad gave me slightly more confident in our transform propagation. 😄

jjsjann123 · 2024-12-12T08:20:29Z

csrc/scheduler/pointwise_utils.cpp

+    //     -> concrete map to i1
+    // So T3 is contained by T2. See test `PointwiseTest.DomainMapPad1`
+    auto concrete_source_id_out =
+        ca_map_.getConcreteMappedID(source_id_out, IdMappingMode::PERMISSIVE);


This is the change I made in order to avoid the regression in the added test. PointwiseTest.DomainMapPad1

Looks like this did work with our tests.

@naoyam Let me know what you think about this change.

jjsjann123 · 2024-12-12T08:21:07Z

!test --diff-bench

jjsjann123 · 2024-12-12T10:55:44Z

errr... what's with CI 😭

…h' into HEAD

jjsjann123 · 2024-12-12T10:59:27Z

!test --diff-bench

jjsjann123 · 2024-12-12T17:59:05Z

My gut feeling, based on my naive understanding about transform propagation, if we actually have an expandOp, or even just use a resize to link the IDs that are expanded, I feel nvfuser might be able to codegen the original issue #3512 without a problem.

The other scenario is also pretty interesting to me. #3576

NOTE for myself. I think I should go have a look at how transform propagation actually works to verify this for a piece of mind.

jjsjann123 added 4 commits December 2, 2024 11:45

will this work?

f0ce0e3

errr

70e31bf

missed a few renaming

5f09e36

WIP

9ad9edb

jjsjann123 changed the title ~~Pw scheduler reference find patch~~ [SMOKE TEST] Pw scheduler reference find patch Dec 2, 2024

test added

6540201

jjsjann123 mentioned this pull request Dec 3, 2024

Internal assert with Thunder CUDA Falcon 40b like #3505

Closed

jjsjann123 added 12 commits December 3, 2024 19:11

Merge branch 'main' into pw_scheduler_reference_find_patch

89dc741

WIP

ed56c75

WIP

f1e7e0a

declaration

9d174c9

WIP

bf425eb

WIP

aef13ac

refactor the traversal

a9ae516

WIP

d9e8dc0

scratch that, it's getting out of hand

7333806

Revert "scratch that, it's getting out of hand"

f6ad363

This reverts commit 7333806.

try focus on expanded dimensions

cef0b83

wip

a557a8b

jjsjann123 and others added 3 commits December 4, 2024 22:26

Merge branch 'main' into pw_scheduler_reference_find_patch

65cb621

lintrunner

f88ebf7

comment added

ea89b69

jjsjann123 changed the title ~~[SMOKE TEST] Pw scheduler reference find patch~~ pointwise scheduler fails to validate reference tv Dec 5, 2024

jjsjann123 and others added 2 commits December 5, 2024 02:36

fixing

0fc0dc1

Merge branch 'main' into pw_scheduler_reference_find_patch

07797a4

jjsjann123 requested a review from naoyam December 5, 2024 10:37

jjsjann123 added 7 commits December 11, 2024 13:03

removing checks that are not exposed by scheduler

742f7f3

renaming things

145d902

err somehow I missed this one

19291c8

updating tests

526e6b7

adding another test

0ecc1f6

test fixing

6abaa1d

fixing tests

e8a4ddd

jjsjann123 added 4 commits December 11, 2024 19:13

CLANGFORMAT

6ada657

removing python test since it's already covered in cpp test

d797df8

oops, assert was placed in the wrong spot

25362cd

CLANGFORMAT

a129e72

adding naoya's example

b7f2efb

jjsjann123 commented Dec 12, 2024

View reviewed changes

jjsjann123 added 2 commits December 11, 2024 22:57

I was padding the wrong dimension here

307569f

made a small refactor to avoid regression

af315c7

jjsjann123 commented Dec 12, 2024

View reviewed changes

Merge branch 'main' into pw_scheduler_reference_find_patch

7668d4e

jjsjann123 added 2 commits December 12, 2024 02:58

committing something so I can trigger CI again

d46323c

Merge remote-tracking branch 'origin/pw_scheduler_reference_find_patc…

c70f160

…h' into HEAD

jjsjann123 requested a review from naoyam December 12, 2024 17:46

tfogal mentioned this pull request Dec 13, 2024

HF Transformers ViT slower than torch.compile and raw pytorch Lightning-AI/lightning-thunder#1502

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pointwise scheduler fails to validate reference tv #3513

pointwise scheduler fails to validate reference tv #3513

jjsjann123 commented Dec 2, 2024 •

edited

Loading

jjsjann123 commented Dec 2, 2024

jjsjann123 commented Dec 5, 2024

jjsjann123 commented Dec 5, 2024

jjsjann123 commented Dec 5, 2024

jjsjann123 commented Dec 12, 2024

jjsjann123 commented Dec 12, 2024

naoyam commented Dec 12, 2024

jjsjann123 commented Dec 12, 2024 •

edited

Loading

jjsjann123 commented Dec 12, 2024

jjsjann123 Dec 12, 2024

jjsjann123 Dec 12, 2024

naoyam Dec 12, 2024

naoyam Dec 12, 2024

jjsjann123 Dec 12, 2024

jjsjann123 Dec 12, 2024

jjsjann123 Dec 12, 2024

jjsjann123 Dec 12, 2024

jjsjann123 commented Dec 12, 2024

jjsjann123 commented Dec 12, 2024

jjsjann123 commented Dec 12, 2024

jjsjann123 commented Dec 12, 2024 •

edited

Loading

pointwise scheduler fails to validate reference tv #3513

Are you sure you want to change the base?

pointwise scheduler fails to validate reference tv #3513

Conversation

jjsjann123 commented Dec 2, 2024 • edited Loading

jjsjann123 commented Dec 2, 2024

jjsjann123 commented Dec 5, 2024

jjsjann123 commented Dec 5, 2024

jjsjann123 commented Dec 5, 2024

jjsjann123 commented Dec 12, 2024

jjsjann123 commented Dec 12, 2024

naoyam commented Dec 12, 2024

jjsjann123 commented Dec 12, 2024 • edited Loading

jjsjann123 commented Dec 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jjsjann123 commented Dec 12, 2024

jjsjann123 commented Dec 12, 2024

jjsjann123 commented Dec 12, 2024

jjsjann123 commented Dec 12, 2024 • edited Loading

jjsjann123 commented Dec 2, 2024 •

edited

Loading

jjsjann123 commented Dec 12, 2024 •

edited

Loading

jjsjann123 commented Dec 12, 2024 •

edited

Loading