Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Yet another) indexing war for resize #3454

Merged
merged 15 commits into from
Nov 28, 2024
Merged

Conversation

naoyam
Copy link
Collaborator

@naoyam naoyam commented Nov 20, 2024

This is a WAR for #3455. The exact graph-based indexing doesn't work because of the mapping introduced by the residual path. I think we should investigate what the right graph should look like for indexing, but to unblock the scheduler for RoPE, this PR tries to work around the issue by creating a local graph that only includes the tensors involved in the expression to index, thus removing the effect by the residual path.

IndexngTraversal::getExprsBetweenForResize is the main addition, which creates a new IdModel just consisting of the tensors of a given expr. If a resize is used in any of the producers and consumers of the expr, we use the path found by the local model. Currently, it it fails to find a path, it's considered an error.

While this WAR works for the prototype scheduler for RoPE so far (#3425), it does have some issues as well. For example, since the local IdModel doesn't have all the information necessary to identify loop promotions, but the loop domain of the expr may be promoted, so it may not be able to find the corresponding IDs within the local model. In other words, if resize is used with inlined broadcast IDs, getExprsBetweenForResize may fail to find a path, which would then fall back to the existing path, which may not be correct in the case of #3455. However, this can be avoided by scheduling the loop domains such that no promotion analysis is required. We can now do this by using things like TensorDomain::broadcast() and scheduler_tools::scheduleLoopDomainsLike(), so I don't think this issue is a blocker.

The overall changes are also due to the change of the interface of IndexingTraversal::getExprsBetween, which now requires std::vector<IterDomain*> instead of ValGroups since for the local IdModel, the former is required.

@naoyam
Copy link
Collaborator Author

naoyam commented Nov 20, 2024

!test

@naoyam
Copy link
Collaborator Author

naoyam commented Nov 21, 2024

!test --diff

@naoyam naoyam added the rope label Nov 21, 2024
@naoyam
Copy link
Collaborator Author

naoyam commented Nov 21, 2024

!test

@naoyam naoyam marked this pull request as ready for review November 21, 2024 19:54
@naoyam
Copy link
Collaborator Author

naoyam commented Nov 21, 2024

@jacobhinkle @zasdfgbnm While the H100 tests are still blocked, could you please start reviewing this PR? I'll also do the codegen diff tests once the H100 tests are completed.

Comment on lines 2654 to 2670
// TensorIndexer needs IterDomain instead of ValGroup to work around
// the resize indexing issue
std::vector<IterDomain*> ids_to_index;
ids_to_index.reserve(groups_to_index.size());
const auto tma_all_ids = is_load ? consumer_tv->domain()->allIDs()
: producer_tv->domain()->allIDs();
for (const auto& group : groups_to_index) {
auto it = std::find_if(
tma_all_ids.begin(), tma_all_ids.end(), [&](IterDomain* gmem_id) {
return group->has(gmem_id);
});
NVF_ERROR(
it != tma_all_ids.end(),
"Cannot find corresponding ID for ",
nvfuser::toString(group));
ids_to_index.push_back(*it);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the unit tests of TMA all pass? I remembered that some ids may be newly created IDs, therefore not in the allIDs of the tv.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's I wanted to confirm with the CI tests. I'll check them manually.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, these tests failed here:

[  FAILED  ] TMAIndexingTest.DefineBoxByCompositing1
[  FAILED  ] TMAIndexingTest.DefineBoxByCompositing2
[  FAILED  ] TMAIndexingTest.NonTrivialGmemAllocationDomain2

I think we can just pick any ID from the ValGroup. The arguments to TensorIndexer doesn't need to be those of the indexed tensors. It is required for the resize WAR case (https://github.com/NVIDIA/Fuser/blob/resize_indexing_cyclic_graph_war/csrc/id_model/indexing_traversal.cpp#L81-L104), but as long as TMA is not used resize, it wouldn't fail.

This is obviously far from ideal since it would fail if both TMA and resize are used but at least that should be a hard failure rather than silently using invalid indexing paths.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a WAR, I am OK with switching to use ID, and using an arbitrary ID sounds good to me. What I am more interested is, in the future, assuming that we have a real fix for the resize problem. Would we still use ID, or will we use ValGroup?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should switch back to ValGroup, although Val may not be just IterDomain anymore, like you suggested, we may need to include Expr as a key as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed all tests pass on H100.

@naoyam naoyam changed the title [WIP] (Yet another) indexing war for resize (Yet another) indexing war for resize Nov 22, 2024
@naoyam
Copy link
Collaborator Author

naoyam commented Nov 22, 2024

!test --diff

Copy link
Collaborator

@jacobhinkle jacobhinkle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the example in #3455, I'm having a little trouble seeing how this will work. When we traverse, we first following the path from t0 to t1, t2, t3. That uses resize-based indexing, and sets the index map to reflect that. Then, when indexing t0 as producer for t4, we find a resize in the ExprPath but no Resize in t0 or t4, so getExprsBetweenForResize should now return nullopt and we instead do a ValGraph traversal which finds the shorter path and updates the index map. Is that right? If there were a further use of t3 that happened to be traversed after t4 in that example, then would it wind up using the correct (Resize-based) indexing or would it have switched to the more recent t4/t0 index?

@@ -62,7 +62,10 @@ class IdGraphIndexCompute : public OptOutDispatch {
}

void setIndex(IterDomain* id, Val* idx) {
index_map_.emplace(toGroup(id), idx);
// May overwrite index. When the graph is cyclic due to, e.g.,
// resize, the index obtained by traversing most through the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"by traversing most recently"? So the index map will reflect whichever Expr was most recently processed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It's a good question as this will have some related change in a follow-up PR 😃

@naoyam
Copy link
Collaborator Author

naoyam commented Nov 22, 2024

For the example in #3455, I'm having a little trouble seeing how this will work. When we traverse, we first following the path from t0 to t1, t2, t3. That uses resize-based indexing, and sets the index map to reflect that. Then, when indexing t0 as producer for t4, we find a resize in the ExprPath but no Resize in t0 or t4, so getExprsBetweenForResize should now return nullopt and we instead do a ValGraph traversal which finds the shorter path and updates the index map. Is that right?

I think your understanding is correct.

If there were a further use of t3 that happened to be traversed after t4 in that example, then would it wind up using the correct (Resize-based) indexing or would it have switched to the more recent t4/t0 index?

Are you thinking about a case like:

t1 = t0[:N/2]
t2 = t0[N/2:]
t3 = cat(t2, t1)
t4 = t3 + t0
t5 = t3 + 1

And your question is how to index t5? If so, since t3nort5hasResize`, it would be just done by the default indexing path.

Not appearing here, but there's a hidden tensor between t1 and t3 (and also between t2 and t3), which does have a Resize op. However, the local expr path doesn't include the Resize, it will still fall back to the default path.

Let me know if this clarifies your understanding.

@jacobhinkle
Copy link
Collaborator

Yes, that's the type of case I had in mind. I was thinking that since the ValGroup for t5's ID will be the same as that of x0, it will use the new index as you say and that that one would be incorrect. But it actually is correct, it's just that it's not the same way t0 is indexed in the Resize ops before the cat. OK, I get it now. Thanks for the explanation.

@naoyam
Copy link
Collaborator Author

naoyam commented Nov 23, 2024

!test --diff

@naoyam
Copy link
Collaborator Author

naoyam commented Nov 27, 2024

!test --diff

@naoyam
Copy link
Collaborator Author

naoyam commented Nov 27, 2024

!test

@naoyam
Copy link
Collaborator Author

naoyam commented Nov 28, 2024

@zasdfgbnm @jacobhinkle

I found an indexing bug that showed up with some of the Hopper matmul tests. The last commit of this PR fixed the issue 2d6bc21.

What happened was that updating an existing index mapping resulted in making the index replacement fail to do its job. IdGraphIndexCompute::setIndex now updates an existing mapping. This change is necessary for resize indexing, but suppose we have:

Loop domain: {i0(8)}
split i0 by 8 -> i1(1), i2(8)
Initial loop index of i0: i0_loop_idx

Since i0 and i2 are mapped in the AlmostExact graph, the assigned index of {i0, i2} remained to be i0_loop_idx. However, with this PR, it gets updated to i0_loop_idx % 8. This itself should be fine, but TensorIndexer::getIndexReplacementMap creates a replacement mapping from i0_loop_idx %8, not from i0_loop_idx. So, before this PR, the replacement map looked like:

i0_loop_idx -> i0_actual_idx

Now it looks like:

i0_loop_idx % 8 -> i0_actual_idx

This replacement map is not what we need because indices just using i0_loop_idx won't be replaced. For example, the index of i1 is i0_loop_idx / 8. Previously, this was updated to i0_actual_idx / 8, but after this PR, no replacement is done, which is wrong.

To fix the issue, the replacement map is now constructed with the initial loop indices instead of the computed indices of the loop IDs. TensorIndexer::getLoopIndex gives the initial index of a given loop ID (it needs the actual for-loops to support circular buffering).

I manually confirmed the tests pass on H100 and all other tests seem passing, so I'm going to merge this PR soon, but let me know if you have any concern.

@naoyam naoyam merged commit 8546b62 into main Nov 28, 2024
36 of 37 checks passed
@naoyam naoyam deleted the resize_indexing_cyclic_graph_war branch November 28, 2024 01:25
naoyam added a commit that referenced this pull request Nov 28, 2024
This is a follow-up to #3454. Specifically, as we now allow updating of
index mappings in `TensorIndexer::setIndex`, there can be a case like
below:

```
merge b0(1), i1(8) -> i2(8)
```

When propagating the index of `i2`, `i2_idx`, backward, the input IDs
would get `i2_idx / 8` and `i2_idx % 8`, respectively. However, if
`i2_idx` is not guaranteed to be less than 8 (for example, due to a
non-divisible split of `i2`), the broadcast `b0` id would potentially
get a non-zero index, which means that we would need to predicate `b0`
as well, i.e., `i2_idx / 8 < 1`, if it's part of the allocation domain.
However, this would not be predicated as we ignore broadcast IDs. The
new unit test would fail at the validation due to this predication
problem.

To fix the issue, we could also predicate broadcast allocation IDs.
Instead, this PR takes a simpler approach that just forwards a given
index to its almost-exactly mapped ID as is. In the above case, `b0` and
`i1` would get `0` and `i2_idx`, respectively.

Tested H100 manually.
naoyam added a commit that referenced this pull request Dec 4, 2024
Fixes this error of #3505 

```
Error from segmentation group 9:  INTERNAL ASSERT FAILED at "/Fuser/csrc/id_model/indexing_traversal.cpp":102, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Indexing path for resize not found: iblockIdx.y376{( ceilDiv(1280, blockDim.x) )}
```

The error happens when trying to use the indexing WAR for resize that
was recently added (#3454). The war itself is limited, in particular it
does not work with promoted loop IDs. The limitation should be fine for
the RoPE scheduling I've been working on, but it's a real issue in
general.

This PR avoids the issue by limiting the use of the WAR. Currently, the
WAR is used whenever there's at least a single resize expr in a single
math expr. That is actually overly pessimistic since the indexing issue
only happens when there's multiple resize exprs that result in a cycle
in the exact graph. For example, if there's only one resize, there must
be no cycle, thus the indexing WAR is not necessary.

This PR attempts to limit the use of the WAR by doing a little deeper
analysis. The added check should entirely disable the WAR for the
current default scheduling, where resize is only allowed with fusion
inputs, which means there can be no multiple dependent resize exprs in a
single fusion.

The limitation of the WAR remains, but it does not matter for RoPE, and
with this PR it should also not matter for general cases.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants