-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid indexing path when resize is used with a residual path #3455
Comments
@zasdfgbnm @jacobhinkle Please let me know if this makes sense. I have a WAR (#3454), which isn't ideal at all but works for RoPE. |
It seems like this is specific to Resize. For example if we resize by padding by Is this the only pattern we know of that displays this issue? Maybe it's enough to just not exact map if the two groups already have a Resize ExprGroup between them? This would mean we'd have to have already processed all the Resize Exprs before building the exact graph. If we did that then the Exact graph would become finer than it is today; I'm not sure whether that would cause problems or not but I can imagine we might assume that Iteration input IDs to BinaryOp and TernaryOp are exact mapped in some places. |
I think this indexing issue is specific to Resize, or more specifically the rotation. We map the rotated domain with its input, which of course have the same extent but for the sake of indexing they should not be considered the same. As discussed in #3072, reshape can also result in a cyclic graph, but I don't think it would cause an indexing problem like this.
Yes.
Are you saying we shouldn't map |
You're right, the simple approach is not sufficient. Maybe we need another type of ExprGroup in this case that indicates multiple ValGroups are aligned in an expression such that they would normally be Exact mapped but cannot be, for example in this case, because they would introduce cycles. That would let us take different paths to the ValGroups of i3 and i0 for indexing. For determining if they have the same extents we can derive an "Extents" graph by mapping all the input and output ValGroups of these "Align" ExprGroups. That Extents group might be a substitute for Exact that would work for inlining and other analysis but it could have cycles. |
Why it is "
When we are indexing t0 as a direct producer of t4, we do want i0 to be mapped to i6. Depending on the task we are doing, we may or may not want things to be mapped. I think the fundamental problem is not whether i0 should be mapped with i6 or not. The fundamental problem is we should not traverse the exact graph for indexing. Indexing needs a real index graph, where two items are mapped if and only if the have the same index, as originally planned by Christian. This means, a tensor as a consumer, as a producer of different ops, because they have different indices, they are not mapped. |
Oh, my comment was not that precise. I agree with you. |
Just so I understand this, you mean that this graph would contain one (1+num_uses) copies of each ID in |
I think that's pretty much it.
Partially yes, but it is important to note that:
For example, if we have
Then we will have groups like:
|
This is a WAR for #3455. The exact graph-based indexing doesn't work because of the mapping introduced by the residual path. I think we should investigate what the right graph should look like for indexing, but to unblock the scheduler for RoPE, this PR tries to work around the issue by creating a local graph that only includes the tensors involved in the expression to index, thus removing the effect by the residual path. `IndexngTraversal::getExprsBetweenForResize` is the main addition, which creates a new IdModel just consisting of the tensors of a given expr. If a resize is used in any of the producers and consumers of the expr, we use the path found by the local model. Currently, it it fails to find a path, it's considered an error. While this WAR works for the prototype scheduler for RoPE so far (#3425), it does have some issues as well. For example, since the local IdModel doesn't have all the information necessary to identify loop promotions, but the loop domain of the expr may be promoted, so it may not be able to find the corresponding IDs within the local model. In other words, if resize is used with inlined broadcast IDs, `getExprsBetweenForResize` may fail to find a path, which would then fall back to the existing path, which may not be correct in the case of #3455. However, this can be avoided by scheduling the loop domains such that no promotion analysis is required. We can now do this by using things like `TensorDomain::broadcast()` and `scheduler_tools::scheduleLoopDomainsLike()`, so I don't think this issue is a blocker. The overall changes are also due to the change of the interface of `IndexingTraversal::getExprsBetween`, which now requires `std::vector<IterDomain*>` instead of `ValGroups` since for the local IdModel, the former is required.
Extracted from #3425 A simple scheduling utility that replays a single transform expr on the loop domain of a given tensor. It is conceptually similar to the existing transform propagator, but this interface can also replay a transform expr not just as a forward expr but also as a backward expr. The existing `scheduler_tools::scheduleLoopDomainsLike` can also be used to propagate loop domains across tensors, but I encountered with a couple of issues due to the resize mapping [issue](#3455). This `scheduleLoopDomainsBy` gives more explicit control to set loop domains with resize ops.
Stacked on top of #3549. This is also a WAR for #3455 and necessary to schedule RoPE-like rotation patterns. Because of the issue, a tensor may have two IDs that are exactly mapped. For example, when an ID is sliced to half and then is padded back to the same size, and the final output ID is used with the initial input ID, the initial input and the final output IDs get mapped together. This can make it difficult to use `scheduleLoopDomainsLike`. For example, if a reference has a split that is done with the final output ID, and we want to replay the split on other tensors, it becomes ambiguous whether the split is done with the initial input or the final output since both are exactly mapped. To avoid this ambiguity, this PR adds a flag to indicate that we just want to update the current loop domain with a reference domain. As seen in the added tests, this flag is used to propagate the scheduling of a reference tensor once all resize ops are propagated to inputs. Specifically, the overall scheduling follows this pattern: 1. Propagate all slice and pad ops to fusion inputs 2. Pick and schedule a reference tensor 3. Propagate the scheduling of the reference tensor to the other tensors `scheduleLoopDomainsLike` with the flag is used at step 3. For that step, we know that we don't need to schedule each tensor with a complex replay path, like some backward ops followed by some other forward ops, but we just need to update the current loop domain by replaying the diff with the reference domain.
As discussed in #3072, a common pattern in RoPE results in a cyclic exact graph. For example, when a domain is split to two halves, rotated and concatenated, if the resulting domain is also used with the initial input domain, the initial domain and the final domain are mapped, and there are the resize expressions for the slice and concat, yielding a cycle in the exact graph.
This may also cause an invalid indexing path when scheduled in a certain way. For example, we could schedule a rotation pattern as illustrated below:
t0
is the input tensor with only one iter domain,i0
. The math should look like:The scheduling illustrated here is consumer-based scheduling, meaning the loop domain of the final output,
t3
, is propagated back tot1
andt2
. They are scheduled by additionalresize
such that their loop domains match with the logical domain oft3
. In this way, since all loop domains are exactly mapped, this fusion can be freely parallelized without any synchronization. This scheduling approach is what I'm aiming to have as a first version of the scheduler for RoPE.However, if this fusion also has a residual path like shown below (i.e.,
t4 = t0 + t3
), indexingt0
as the producer oft1
ort2
doesn't work due to the exact mapping ofi0
andi4
(andi5
).When indexing
t0
fort1
, its loop domain is justi4
, which is mapped withi0
. In the new indexing method, this simply means the loop index ofi4
can be just used as is for indexingt0
. Since they are grouped together, there's no indexing traversal involved.However, this is clearly incorrect. For
t1
, we need the left half oft0
to be placed at the right half oft1
. So, the index math should look likei - N/2
, wherei
is the loop index ofi4
. Similarly, fort2
, it should bei + N/2
.Fundamentally, this seems to indicate
i0
should not be mapped with{i4, i5, i3, i6}
. However, it's automatically done in our current formulation of the exact graph because of thet4
expression,t4 = t0 + t3
. While they do have the same extent, they may need to be considered unmapped for indexing.The text was updated successfully, but these errors were encountered: