Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial_loop_domain_ to TensorDomain #2987

Merged
merged 1 commit into from
Sep 23, 2024
Merged

Add initial_loop_domain_ to TensorDomain #2987

merged 1 commit into from
Sep 23, 2024

Conversation

naoyam
Copy link
Collaborator

@naoyam naoyam commented Sep 21, 2024

It turned out it's necessary to remember the initial loop domain, which is either the logical domain set by the constructor or the new loop domain set by setLoopDomain. When a loop domain has an extra ID, TensorDomain::allIDs may miss IDs that solely depend on the extra ID.

It turned out it's necessary to remember the initial loop domain, which
is either the logical domain set by the constructor or the new loop
domain set by setLoopDomain. When a loop domain has an extra ID,
TensorDomain::allIDs may miss IDs that solely depend on the extra ID.
@naoyam
Copy link
Collaborator Author

naoyam commented Sep 21, 2024

!build

@csarofeen
Copy link
Collaborator

When a loop domain has an extra ID

When does that happen?

@naoyam
Copy link
Collaborator Author

naoyam commented Sep 23, 2024

I'm still prototyping, but, for example:

t0 = [i0, i1] // input
t1 = [i0] // input

t2 = t0[:, n:] // slice t0 to [i0, i2] 
t3 = broadcast(t1, {false, true}) // [i0, b3]
t4 = t2 + t3 // [i0, i2]

The producer-based scheduling would use the loop domain of t0, [i0, i1], as the reference loop domain. For t1, I think it'd be the most natural to create a new broadcast ID and set its loop domain as [i0, b4]. This b4 could be also exactly mapped with b3.

@@ -3614,6 +3617,7 @@ void TensorDomain::setLoopDomain(std::vector<IterDomain*> new_loop_domain) {
". Logical: ",
toDelimitedString(logical_domain_));
loop_domain_ = std::move(new_loop_domain);
initial_loop_domain_ = loop_domain_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just append everything in loop_domain_ here to additional_ids_? In allIDs, we might want to do:

for (auto i : c10::irange(all_domains.size() - 1)) {
  for (auto j : c10::irange(all_domains.size() - 1)) {

instead of

for (auto i : c10::irange(all_domains.size() - 1)) {
  for (auto j : c10::irange(i + 1, all_domains.size())) {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did think about it, but since additional_ids_ are just "additional", it seems to me it could be useful in some cases. For example, we could cache all IDs between logical and initial_loop to speed up TensorDomain::allIDs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

additional_ids_ was the mechanism designed to allow loop domain to have a new broadcast, and the reason why it exists was to solve almost exactly the same problem as this PR solves. Moving forward, looks like we will not continue with the "allow loop domain to have a new broadcast" direction. But I don't want to have two mechanisms for the same thing. We should pick either of them, but only one should be picked. It makes sense to speed up TensorDomain::allIDs, and it might also make sense to call the data structure we use for the speedup additional_ids_, but it will be a new thing that is totally unrelated to today's additional_ids_ except they happen to have the same name.

Copy link
Collaborator

@zasdfgbnm zasdfgbnm Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which mechanism to pick I believe is mostly a question of interface design. Today, we have:

void TensorDomain::broadcast(int64_t axis, Val* extent) {
  axis = nvfuser::wrapDim(axis, nDims() + 1);
  IterDomain* id = IterDomainBuilder(fusion()->zeroVal(), extent)
                       .iter_type(IterType::Broadcast)
                       .build();
  loop_domain_.insert(loop_domain_.begin() + axis, id);
  additional_ids_.push_back(id);
}

That creates a new broadcast on the loop domain. But if we just do setLoopDomain(something with manually created new broadcasts), we will fail.

And in this PR, we are not using an interface like TensorDomain::broadcast, so it will not have the problem of the above mechanism. But if we just do a setLoopDomain(getLoopDomain()), we will fail.

In summary, I believe neither the additional_ids_ mechanism nor the one in this PR is safe. The additional_ids_ is more convenient for the TensorDomain::broadcast-like interface, that is, we have a method function that says "add this new ID to this axis". And the one in this PR is more convenient for setLoopDomain way of inserting new IDs.

I think my question for now is: do we want to pause a bit to think about interface design, or do we want to just quickly unblock for now and do some cleanup later?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per our offline discussion, we decided that it's more important to move forward and get some progress before too much worrying about interface designs. See also #3000

@naoyam naoyam merged commit 3d82ae9 into main Sep 23, 2024
38 checks passed
@naoyam naoyam deleted the initial_loop_domain branch September 23, 2024 23:41
naoyam added a commit that referenced this pull request Sep 26, 2024
…n-conventional loop domains (#2983)

Part of #2902. See the linked design doc for the background context.

~Stacked on #2984~
Stacked on #2987 

The IdModel-based analysis is required for non-conventional loop domains
as the dependency from the logical to loop domains may not be just one
way from logical to loop. However, the IdModel-based analysis does not
support broadcast forwarding, so we can't just completely switch to the
new method.

For now, I think it makes most sense to use the existing method whenever
logical and loop domains are generated in the conventional scheduling
primitives such as split and merge. The new method is only used
otherwise.

I also considered implementing the broadcast forwarding in the
IdModel-based approach as well, but it doesn't seem to me worthwhile
doing so at this moment. I thought maybe using the Permissive graph
would make it trivial, but there seem to be (not a small number of)
subtle corner cases of mappings that may not be considered in the
Permissive graph, e.g., if the indexed domains should be mapped or not.

It should be certainly possible to extend the Permissive graph to
address these corner cases, but another factor I think we should
consider is that the broadcast forwarding itself may not be necessary
with the setLoopDomain-based approach. We should probably set the right
loop domain from the beginning instead of trying to make different loop
domains to be inlinable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants