Invalid indexing path when resize is used with a residual path #3455

naoyam · 2024-11-20T19:02:21Z

As discussed in #3072, a common pattern in RoPE results in a cyclic exact graph. For example, when a domain is split to two halves, rotated and concatenated, if the resulting domain is also used with the initial input domain, the initial domain and the final domain are mapped, and there are the resize expressions for the slice and concat, yielding a cycle in the exact graph.

This may also cause an invalid indexing path when scheduled in a certain way. For example, we could schedule a rotation pattern as illustrated below:

t0 is the input tensor with only one iter domain, i0. The math should look like:

t1 = t0[:N/2]
t2 = t0[N/2:]
t3 = cat(t2, t1)

The scheduling illustrated here is consumer-based scheduling, meaning the loop domain of the final output, t3, is propagated back to t1 and t2. They are scheduled by additional resize such that their loop domains match with the logical domain of t3. In this way, since all loop domains are exactly mapped, this fusion can be freely parallelized without any synchronization. This scheduling approach is what I'm aiming to have as a first version of the scheduler for RoPE.

However, if this fusion also has a residual path like shown below (i.e., t4 = t0 + t3), indexing t0 as the producer of t1 or t2 doesn't work due to the exact mapping of i0 and i4 (and i5).

When indexing t0 for t1, its loop domain is just i4, which is mapped with i0. In the new indexing method, this simply means the loop index of i4 can be just used as is for indexing t0. Since they are grouped together, there's no indexing traversal involved.

However, this is clearly incorrect. For t1, we need the left half of t0 to be placed at the right half of t1. So, the index math should look like i - N/2, where i is the loop index of i4. Similarly, for t2, it should be i + N/2.

Fundamentally, this seems to indicate i0 should not be mapped with {i4, i5, i3, i6}. However, it's automatically done in our current formulation of the exact graph because of the t4 expression, t4 = t0 + t3. While they do have the same extent, they may need to be considered unmapped for indexing.

The text was updated successfully, but these errors were encountered:

naoyam · 2024-11-20T19:13:40Z

@zasdfgbnm @jacobhinkle Please let me know if this makes sense. I have a WAR (#3454), which isn't ideal at all but works for RoPE.

jacobhinkle · 2024-11-20T19:59:33Z

It seems like this is specific to Resize. For example if we resize by padding by a on the left and -a on the right, we could use the result in a binaryOp with the original input and wind up with the same situation.

Is this the only pattern we know of that displays this issue?

Maybe it's enough to just not exact map if the two groups already have a Resize ExprGroup between them? This would mean we'd have to have already processed all the Resize Exprs before building the exact graph. If we did that then the Exact graph would become finer than it is today; I'm not sure whether that would cause problems or not but I can imagine we might assume that Iteration input IDs to BinaryOp and TernaryOp are exact mapped in some places.

naoyam · 2024-11-20T20:19:46Z

It seems like this is specific to Resize. For example if we resize by padding by a on the left and -a on the right, we could use the result in a binaryOp with the original input and wind up with the same situation.

I think this indexing issue is specific to Resize, or more specifically the rotation. We map the rotated domain with its input, which of course have the same extent but for the sake of indexing they should not be considered the same.

As discussed in #3072, reshape can also result in a cyclic graph, but I don't think it would cause an indexing problem like this.

Is this the only pattern we know of that displays this issue?

Yes.

Maybe it's enough to just not exact map if the two groups already have a Resize ExprGroup between them? This would mean we'd have to have already processed all the Resize Exprs before building the exact graph. If we did that then the Exact graph would become finer than it is today; I'm not sure whether that would cause problems or not but I can imagine we might assume that Iteration input IDs to BinaryOp and TernaryOp are exact mapped in some places.

Are you saying we shouldn't map i0 and i6? Would that mean i3 should also not be mapped with i6? I'm pretty sure that would cause a different problem with inlining and sync analysis since the exact graph is currently used as the basis of all the other graphs.

jacobhinkle · 2024-11-20T21:16:45Z

You're right, the simple approach is not sufficient.

Maybe we need another type of ExprGroup in this case that indicates multiple ValGroups are aligned in an expression such that they would normally be Exact mapped but cannot be, for example in this case, because they would introduce cycles. That would let us take different paths to the ValGroups of i3 and i0 for indexing. For determining if they have the same extents we can derive an "Extents" graph by mapping all the input and output ValGroups of these "Align" ExprGroups. That Extents group might be a substitute for Exact that would work for inlining and other analysis but it could have cycles.

zasdfgbnm · 2024-11-21T18:26:18Z

Fundamentally, this seems to indicate i0 should not be mapped with {i4, i5, i3, i6}.

Why it is "i0 should not be mapped with {i4, i5, i3, i6}", instead of the following?

{i0, i6} should not be mapped with {i4, i5, i3}
{i0, i6, i3} should not be mapped with {i4, i5}

When we are indexing t0 as a direct producer of t4, we do want i0 to be mapped to i6. Depending on the task we are doing, we may or may not want things to be mapped.

I think the fundamental problem is not whether i0 should be mapped with i6 or not. The fundamental problem is we should not traverse the exact graph for indexing. Indexing needs a real index graph, where two items are mapped if and only if the have the same index, as originally planned by Christian. This means, a tensor as a consumer, as a producer of different ops, because they have different indices, they are not mapped.

naoyam · 2024-11-21T19:14:23Z

Oh, my comment was not that precise. I agree with you.

jacobhinkle · 2024-11-21T21:13:36Z

This means, a tensor as a consumer, as a producer of different ops, because they have different indices, they are not mapped.

Just so I understand this, you mean that this graph would contain one (1+num_uses) copies of each ID in tv->allIDs() (or some subset if possible). Then we would do mapping by looking at each TV Expr and mapping the producer logical IDs' copy for that Expr with the consumer copy of the corresponding output TVs maybeRoot ID. This way the ValGraph doesn't have paths that propagate beyond one TV expression at a time, but maybe that's the point? It feels like that means we would be creating a single graph per producer/consumer TV pair.

zasdfgbnm · 2024-11-21T22:35:45Z

Just so I understand this, you mean that this graph would contain one (1+num_uses) copies of each ID in tv->allIDs() (or some subset if possible). Then we would do mapping by looking at each TV Expr and mapping the producer logical IDs' copy for that Expr with the consumer copy of the corresponding output TVs maybeRoot ID.

I think that's pretty much it.

It feels like that means we would be creating a single graph per producer/consumer TV pair.

Partially yes, but it is important to note that:

We should still use a single ValGraph to manage all these relations. We can use (IterDomain, TV expr) as the key for ValGraph.
We should map IterDomain across TV exprs based on inlining.

For example, if we have

T2[b1, I1] = set(T0[b0, I0]) ca_pos(1)
T3[I2, I3] = add(T2[b1, I1], T1[I4, I5])

Then we will have groups like:

g0: {(b0, set), (b1, set), (b1, add)}
g1: {(I0, set), (I1, set)}
g2: {(I1, add), (I5, add), (I3, add)}
g3: {(I2, add), (I4, add)}

This is a WAR for #3455. The exact graph-based indexing doesn't work because of the mapping introduced by the residual path. I think we should investigate what the right graph should look like for indexing, but to unblock the scheduler for RoPE, this PR tries to work around the issue by creating a local graph that only includes the tensors involved in the expression to index, thus removing the effect by the residual path. `IndexngTraversal::getExprsBetweenForResize` is the main addition, which creates a new IdModel just consisting of the tensors of a given expr. If a resize is used in any of the producers and consumers of the expr, we use the path found by the local model. Currently, it it fails to find a path, it's considered an error. While this WAR works for the prototype scheduler for RoPE so far (#3425), it does have some issues as well. For example, since the local IdModel doesn't have all the information necessary to identify loop promotions, but the loop domain of the expr may be promoted, so it may not be able to find the corresponding IDs within the local model. In other words, if resize is used with inlined broadcast IDs, `getExprsBetweenForResize` may fail to find a path, which would then fall back to the existing path, which may not be correct in the case of #3455. However, this can be avoided by scheduling the loop domains such that no promotion analysis is required. We can now do this by using things like `TensorDomain::broadcast()` and `scheduler_tools::scheduleLoopDomainsLike()`, so I don't think this issue is a blocker. The overall changes are also due to the change of the interface of `IndexingTraversal::getExprsBetween`, which now requires `std::vector<IterDomain*>` instead of `ValGroups` since for the local IdModel, the former is required.

Extracted from #3425 A simple scheduling utility that replays a single transform expr on the loop domain of a given tensor. It is conceptually similar to the existing transform propagator, but this interface can also replay a transform expr not just as a forward expr but also as a backward expr. The existing `scheduler_tools::scheduleLoopDomainsLike` can also be used to propagate loop domains across tensors, but I encountered with a couple of issues due to the resize mapping [issue](#3455). This `scheduleLoopDomainsBy` gives more explicit control to set loop domains with resize ops.

Stacked on top of #3549. This is also a WAR for #3455 and necessary to schedule RoPE-like rotation patterns. Because of the issue, a tensor may have two IDs that are exactly mapped. For example, when an ID is sliced to half and then is padded back to the same size, and the final output ID is used with the initial input ID, the initial input and the final output IDs get mapped together. This can make it difficult to use `scheduleLoopDomainsLike`. For example, if a reference has a split that is done with the final output ID, and we want to replay the split on other tensors, it becomes ambiguous whether the split is done with the initial input or the final output since both are exactly mapped. To avoid this ambiguity, this PR adds a flag to indicate that we just want to update the current loop domain with a reference domain. As seen in the added tests, this flag is used to propagate the scheduling of a reference tensor once all resize ops are propagated to inputs. Specifically, the overall scheduling follows this pattern: 1. Propagate all slice and pad ops to fusion inputs 2. Pick and schedule a reference tensor 3. Propagate the scheduling of the reference tensor to the other tensors `scheduleLoopDomainsLike` with the flag is used at step 3. For that step, we know that we don't need to schedule each tensor with a complex replay path, like some backward ops followed by some other forward ops, but we just need to update the current loop domain by replaying the diff with the reference domain.

naoyam self-assigned this Nov 20, 2024

naoyam added the Indexing Ops label Nov 20, 2024

naoyam mentioned this issue Nov 20, 2024

(Yet another) indexing war for resize #3454

Merged

naoyam mentioned this issue Dec 5, 2024

Slicing to broadcast results in AlmostExact mapping that may give incorrect indexing path #3527

Closed

naoyam mentioned this issue Dec 10, 2024

WAR for supporting the rotation + residual pattern #3555

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid indexing path when resize is used with a residual path #3455

Invalid indexing path when resize is used with a residual path #3455

naoyam commented Nov 20, 2024 •

edited

Loading

naoyam commented Nov 20, 2024 •

edited

Loading

jacobhinkle commented Nov 20, 2024

naoyam commented Nov 20, 2024

jacobhinkle commented Nov 20, 2024

zasdfgbnm commented Nov 21, 2024

naoyam commented Nov 21, 2024

jacobhinkle commented Nov 21, 2024 •

edited

Loading

zasdfgbnm commented Nov 21, 2024

Invalid indexing path when resize is used with a residual path #3455

Invalid indexing path when resize is used with a residual path #3455

Comments

naoyam commented Nov 20, 2024 • edited Loading

naoyam commented Nov 20, 2024 • edited Loading

jacobhinkle commented Nov 20, 2024

naoyam commented Nov 20, 2024

jacobhinkle commented Nov 20, 2024

zasdfgbnm commented Nov 21, 2024

naoyam commented Nov 21, 2024

jacobhinkle commented Nov 21, 2024 • edited Loading

zasdfgbnm commented Nov 21, 2024

naoyam commented Nov 20, 2024 •

edited

Loading

naoyam commented Nov 20, 2024 •

edited

Loading

jacobhinkle commented Nov 21, 2024 •

edited

Loading