(Yet another) indexing war for resize #3454

naoyam · 2024-11-20T17:47:18Z

This is a WAR for #3455. The exact graph-based indexing doesn't work because of the mapping introduced by the residual path. I think we should investigate what the right graph should look like for indexing, but to unblock the scheduler for RoPE, this PR tries to work around the issue by creating a local graph that only includes the tensors involved in the expression to index, thus removing the effect by the residual path.

IndexngTraversal::getExprsBetweenForResize is the main addition, which creates a new IdModel just consisting of the tensors of a given expr. If a resize is used in any of the producers and consumers of the expr, we use the path found by the local model. Currently, it it fails to find a path, it's considered an error.

While this WAR works for the prototype scheduler for RoPE so far (#3425), it does have some issues as well. For example, since the local IdModel doesn't have all the information necessary to identify loop promotions, but the loop domain of the expr may be promoted, so it may not be able to find the corresponding IDs within the local model. In other words, if resize is used with inlined broadcast IDs, getExprsBetweenForResize may fail to find a path, which would then fall back to the existing path, which may not be correct in the case of #3455. However, this can be avoided by scheduling the loop domains such that no promotion analysis is required. We can now do this by using things like TensorDomain::broadcast() and scheduler_tools::scheduleLoopDomainsLike(), so I don't think this issue is a blocker.

The overall changes are also due to the change of the interface of IndexingTraversal::getExprsBetween, which now requires std::vector<IterDomain*> instead of ValGroups since for the local IdModel, the former is required.

naoyam · 2024-11-20T17:54:48Z

!test

naoyam · 2024-11-21T00:38:43Z

!test --diff

naoyam · 2024-11-21T02:49:04Z

!test

naoyam · 2024-11-21T19:56:22Z

@jacobhinkle @zasdfgbnm While the H100 tests are still blocked, could you please start reviewing this PR? I'll also do the codegen diff tests once the H100 tests are completed.

zasdfgbnm · 2024-11-21T20:45:50Z

csrc/index_compute.cpp

+  // TensorIndexer needs IterDomain instead of ValGroup to work around
+  // the resize indexing issue
+  std::vector<IterDomain*> ids_to_index;
+  ids_to_index.reserve(groups_to_index.size());
+  const auto tma_all_ids = is_load ? consumer_tv->domain()->allIDs()
+                                   : producer_tv->domain()->allIDs();
+  for (const auto& group : groups_to_index) {
+    auto it = std::find_if(
+        tma_all_ids.begin(), tma_all_ids.end(), [&](IterDomain* gmem_id) {
+          return group->has(gmem_id);
+        });
+    NVF_ERROR(
+        it != tma_all_ids.end(),
+        "Cannot find corresponding ID for ",
+        nvfuser::toString(group));
+    ids_to_index.push_back(*it);
+  }


Does the unit tests of TMA all pass? I remembered that some ids may be newly created IDs, therefore not in the allIDs of the tv.

That's I wanted to confirm with the CI tests. I'll check them manually.

OK, these tests failed here:

[ FAILED ] TMAIndexingTest.DefineBoxByCompositing1 [ FAILED ] TMAIndexingTest.DefineBoxByCompositing2 [ FAILED ] TMAIndexingTest.NonTrivialGmemAllocationDomain2

I think we can just pick any ID from the ValGroup. The arguments to TensorIndexer doesn't need to be those of the indexed tensors. It is required for the resize WAR case (https://github.com/NVIDIA/Fuser/blob/resize_indexing_cyclic_graph_war/csrc/id_model/indexing_traversal.cpp#L81-L104), but as long as TMA is not used resize, it wouldn't fail.

This is obviously far from ideal since it would fail if both TMA and resize are used but at least that should be a hard failure rather than silently using invalid indexing paths.

As a WAR, I am OK with switching to use ID, and using an arbitrary ID sounds good to me. What I am more interested is, in the future, assuming that we have a real fix for the resize problem. Would we still use ID, or will we use ValGroup?

I think we should switch back to ValGroup, although Val may not be just IterDomain anymore, like you suggested, we may need to include Expr as a key as well.

Confirmed all tests pass on H100.

naoyam · 2024-11-22T16:32:25Z

!test --diff

jacobhinkle

For the example in #3455, I'm having a little trouble seeing how this will work. When we traverse, we first following the path from t0 to t1, t2, t3. That uses resize-based indexing, and sets the index map to reflect that. Then, when indexing t0 as producer for t4, we find a resize in the ExprPath but no Resize in t0 or t4, so getExprsBetweenForResize should now return nullopt and we instead do a ValGraph traversal which finds the shorter path and updates the index map. Is that right? If there were a further use of t3 that happened to be traversed after t4 in that example, then would it wind up using the correct (Resize-based) indexing or would it have switched to the more recent t4/t0 index?

jacobhinkle · 2024-11-22T16:54:16Z

csrc/id_model/id_model_index_compute.h

@@ -62,7 +62,10 @@ class IdGraphIndexCompute : public OptOutDispatch {
  }

  void setIndex(IterDomain* id, Val* idx) {
-    index_map_.emplace(toGroup(id), idx);
+    // May overwrite index. When the graph is cyclic due to, e.g.,
+    // resize, the index obtained by traversing most through the


"by traversing most recently"? So the index map will reflect whichever Expr was most recently processed?

Yes. It's a good question as this will have some related change in a follow-up PR 😃

naoyam · 2024-11-22T18:56:13Z

For the example in #3455, I'm having a little trouble seeing how this will work. When we traverse, we first following the path from t0 to t1, t2, t3. That uses resize-based indexing, and sets the index map to reflect that. Then, when indexing t0 as producer for t4, we find a resize in the ExprPath but no Resize in t0 or t4, so getExprsBetweenForResize should now return nullopt and we instead do a ValGraph traversal which finds the shorter path and updates the index map. Is that right?

I think your understanding is correct.

If there were a further use of t3 that happened to be traversed after t4 in that example, then would it wind up using the correct (Resize-based) indexing or would it have switched to the more recent t4/t0 index?

Are you thinking about a case like:

t1 = t0[:N/2]
t2 = t0[N/2:]
t3 = cat(t2, t1)
t4 = t3 + t0
t5 = t3 + 1

And your question is how to index t5? If so, since t3nort5hasResize`, it would be just done by the default indexing path.

Not appearing here, but there's a hidden tensor between t1 and t3 (and also between t2 and t3), which does have a Resize op. However, the local expr path doesn't include the Resize, it will still fall back to the default path.

Let me know if this clarifies your understanding.

jacobhinkle · 2024-11-22T19:11:26Z

Yes, that's the type of case I had in mind. I was thinking that since the ValGroup for t5's ID will be the same as that of x0, it will use the new index as you say and that that one would be incorrect. But it actually is correct, it's just that it's not the same way t0 is indexed in the Resize ops before the cat. OK, I get it now. Thanks for the explanation.

…c_graph_war

naoyam · 2024-11-23T03:59:20Z

!test --diff

Avoid mapping of, e.g., split 32 by 64 -> 1, 64

naoyam · 2024-11-27T06:40:50Z

!test --diff

This reverts commit 4bf1829.

naoyam · 2024-11-27T22:39:45Z

!test

naoyam · 2024-11-28T01:24:54Z

@zasdfgbnm @jacobhinkle

I found an indexing bug that showed up with some of the Hopper matmul tests. The last commit of this PR fixed the issue 2d6bc21.

What happened was that updating an existing index mapping resulted in making the index replacement fail to do its job. IdGraphIndexCompute::setIndex now updates an existing mapping. This change is necessary for resize indexing, but suppose we have:

Loop domain: {i0(8)}
split i0 by 8 -> i1(1), i2(8)
Initial loop index of i0: i0_loop_idx

Since i0 and i2 are mapped in the AlmostExact graph, the assigned index of {i0, i2} remained to be i0_loop_idx. However, with this PR, it gets updated to i0_loop_idx % 8. This itself should be fine, but TensorIndexer::getIndexReplacementMap creates a replacement mapping from i0_loop_idx %8, not from i0_loop_idx. So, before this PR, the replacement map looked like:

i0_loop_idx -> i0_actual_idx

Now it looks like:

i0_loop_idx % 8 -> i0_actual_idx

This replacement map is not what we need because indices just using i0_loop_idx won't be replaced. For example, the index of i1 is i0_loop_idx / 8. Previously, this was updated to i0_actual_idx / 8, but after this PR, no replacement is done, which is wrong.

To fix the issue, the replacement map is now constructed with the initial loop indices instead of the computed indices of the loop IDs. TensorIndexer::getLoopIndex gives the initial index of a given loop ID (it needs the actual for-loops to support circular buffering).

I manually confirmed the tests pass on H100 and all other tests seem passing, so I'm going to merge this PR soon, but let me know if you have any concern.

This is a follow-up to #3454. Specifically, as we now allow updating of index mappings in `TensorIndexer::setIndex`, there can be a case like below: ``` merge b0(1), i1(8) -> i2(8) ``` When propagating the index of `i2`, `i2_idx`, backward, the input IDs would get `i2_idx / 8` and `i2_idx % 8`, respectively. However, if `i2_idx` is not guaranteed to be less than 8 (for example, due to a non-divisible split of `i2`), the broadcast `b0` id would potentially get a non-zero index, which means that we would need to predicate `b0` as well, i.e., `i2_idx / 8 < 1`, if it's part of the allocation domain. However, this would not be predicated as we ignore broadcast IDs. The new unit test would fail at the validation due to this predication problem. To fix the issue, we could also predicate broadcast allocation IDs. Instead, this PR takes a simpler approach that just forwards a given index to its almost-exactly mapped ID as is. In the above case, `b0` and `i1` would get `0` and `i2_idx`, respectively. Tested H100 manually.

Fixes this error of #3505 ``` Error from segmentation group 9: INTERNAL ASSERT FAILED at "/Fuser/csrc/id_model/indexing_traversal.cpp":102, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Indexing path for resize not found: iblockIdx.y376{( ceilDiv(1280, blockDim.x) )} ``` The error happens when trying to use the indexing WAR for resize that was recently added (#3454). The war itself is limited, in particular it does not work with promoted loop IDs. The limitation should be fine for the RoPE scheduling I've been working on, but it's a real issue in general. This PR avoids the issue by limiting the use of the WAR. Currently, the WAR is used whenever there's at least a single resize expr in a single math expr. That is actually overly pessimistic since the indexing issue only happens when there's multiple resize exprs that result in a cycle in the exact graph. For example, if there's only one resize, there must be no cycle, thus the indexing WAR is not necessary. This PR attempts to limit the use of the WAR by doing a little deeper analysis. The added check should entirely disable the WAR for the current default scheduling, where resize is only allowed with fusion inputs, which means there can be no multiple dependent resize exprs in a single fusion. The limitation of the WAR remains, but it does not matter for RoPE, and with this PR it should also not matter for general cases.

naoyam added 2 commits November 20, 2024 08:58

Return a bool indicating if all nodes are visited

69a109c

cleanup

01d8e08

naoyam mentioned this pull request Nov 20, 2024

Invalid indexing path when resize is used with a residual path #3455

Open

fix

cb71432

naoyam added the rope label Nov 21, 2024

cleanup

6c1f5a4

naoyam marked this pull request as ready for review November 21, 2024 19:54

naoyam requested review from jacobhinkle and zasdfgbnm November 21, 2024 19:54

zasdfgbnm reviewed Nov 21, 2024

View reviewed changes

fix

fe45289

zasdfgbnm approved these changes Nov 22, 2024

View reviewed changes

naoyam added 2 commits November 22, 2024 08:30

Merge branch 'main' into resize_indexing_cyclic_graph_war

65e5a3f

Revert tests for codegen diff check

4bf1829

naoyam changed the title ~~[WIP] (Yet another) indexing war for resize~~ (Yet another) indexing war for resize Nov 22, 2024

jacobhinkle reviewed Nov 22, 2024

View reviewed changes

Merge remote-tracking branch 'origin/main' into resize_indexing_cycli…

44c48da

…c_graph_war

naoyam added 5 commits November 23, 2024 05:03

Map cases like split iS2 by a factor of 2

c5a95ca

test fix

2862225

Don't map unless the input and output are the same

d26f9d5

Avoid mapping of, e.g., split 32 by 64 -> 1, 64

Merge branch 'more_almost_exact' into resize_indexing_cyclic_graph_war

ffe7e32

Merge branch 'main' into resize_indexing_cyclic_graph_war

b23fdd9

jacobhinkle approved these changes Nov 27, 2024

View reviewed changes

naoyam added 2 commits November 27, 2024 09:26

Revert "Revert tests for codegen diff check"

e572218

This reverts commit 4bf1829.

Fix replacement of loop indices

2d6bc21

naoyam merged commit 8546b62 into main Nov 28, 2024
36 of 37 checks passed

naoyam deleted the resize_indexing_cyclic_graph_war branch November 28, 2024 01:25

naoyam mentioned this pull request Nov 28, 2024

Fixing indexing issue with almost-exact mapping #3494

Merged

jjsjann123 mentioned this pull request Dec 3, 2024

Internal assert with Thunder CUDA Falcon 40b like #3505

Closed

naoyam mentioned this pull request Dec 3, 2024

Limit use of the indexing war for resize #3515

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Yet another) indexing war for resize #3454

(Yet another) indexing war for resize #3454

naoyam commented Nov 20, 2024 •

edited

Loading

naoyam commented Nov 20, 2024

naoyam commented Nov 21, 2024

naoyam commented Nov 21, 2024

naoyam commented Nov 21, 2024

zasdfgbnm Nov 21, 2024

naoyam Nov 21, 2024

naoyam Nov 21, 2024

zasdfgbnm Nov 21, 2024

naoyam Nov 21, 2024

naoyam Nov 22, 2024

naoyam commented Nov 22, 2024

jacobhinkle left a comment •

edited

Loading

jacobhinkle Nov 22, 2024

naoyam Nov 22, 2024

naoyam commented Nov 22, 2024

jacobhinkle commented Nov 22, 2024

naoyam commented Nov 23, 2024

naoyam commented Nov 27, 2024

naoyam commented Nov 27, 2024

naoyam commented Nov 28, 2024

(Yet another) indexing war for resize #3454

(Yet another) indexing war for resize #3454

Conversation

naoyam commented Nov 20, 2024 • edited Loading

naoyam commented Nov 20, 2024

naoyam commented Nov 21, 2024

naoyam commented Nov 21, 2024

naoyam commented Nov 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

naoyam commented Nov 22, 2024

jacobhinkle left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

naoyam commented Nov 22, 2024

jacobhinkle commented Nov 22, 2024

naoyam commented Nov 23, 2024

naoyam commented Nov 27, 2024

naoyam commented Nov 27, 2024

naoyam commented Nov 28, 2024

naoyam commented Nov 20, 2024 •

edited

Loading

jacobhinkle left a comment •

edited

Loading