Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve conflicts by recomputation #3625

Merged
merged 16 commits into from
Dec 31, 2024
Merged

Conversation

naoyam
Copy link
Collaborator

@naoyam naoyam commented Dec 20, 2024

Stacked on top of #3611

This PR resolves the conflicts found by the analysis added at #3611 by recomputing slice/pad input tensors. With this, fusions like ResizeSchedulerTest.SliceRotateCatResidual can be scheduled as a single kernel by the resize scheduler.

Recomputation is not the only possible way to resolve conflicts. We could do, for example, cache a block of a input tensor such that multiple uses of the input can be done by just using the block. This would look more like a produce-based scheduling approach. I prototyped that approach here, but it didn't perform well for RoPE.

@naoyam naoyam force-pushed the resize_scheduler_recomputation branch from 5d1d07e to 7380a40 Compare December 20, 2024 00:48
@naoyam
Copy link
Collaborator Author

naoyam commented Dec 20, 2024

!test

Base automatically changed from resize_scheduler_exclusiveness to main December 20, 2024 04:19
@naoyam
Copy link
Collaborator Author

naoyam commented Dec 20, 2024

!test

@naoyam
Copy link
Collaborator Author

naoyam commented Dec 20, 2024

!test

@@ -133,6 +120,30 @@ bool ResizeScheduler::canScheduleCompileTime(Fusion* fusion) {
return false;
}

for (auto out_tv : ir_utils::filterByType<TensorView>(fusion->outputs())) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is needed since the non-exclusivity check is dropped. It was redundant before.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the changes here are due to the change of the output type of getNonExclusiveResizeInfo and are just mechanical changes.

@naoyam naoyam requested a review from jacobhinkle December 20, 2024 18:16
@naoyam naoyam marked this pull request as ready for review December 20, 2024 18:17
@naoyam naoyam added the rope label Dec 20, 2024
@naoyam naoyam marked this pull request as draft December 24, 2024 01:16
@naoyam
Copy link
Collaborator Author

naoyam commented Dec 24, 2024

Found a bug. Will update soon.

@naoyam
Copy link
Collaborator Author

naoyam commented Dec 24, 2024

!test

@naoyam naoyam marked this pull request as ready for review December 24, 2024 06:56
Copy link
Collaborator

@jacobhinkle jacobhinkle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

csrc/scheduler/resize.cpp Outdated Show resolved Hide resolved
/*require_all_to_visited=*/false)
.first;
for (const auto& [expr_g, dir] : exprs) {
if (expr_g->front()->isA<Resize>()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is my understanding correct that we will segment if we have one resized output and one not resized?

addInput(tv0)
tv1 = 2 * tv0;
tv2 = slice(tv1);
tv3 = 3 * tv2;
addOutput(tv2);
addOutput(tv3);

In that case we will still have a resize between the tv2 and tv3 but it seems like we would potentially be able to schedule it like tv3.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is definitely possible, but it's a non-trivial problem to pick a reference tensor. If the dimension of tv2 is not that different from tv1, either of tv2 or tv3 should be fine. However, if if they are significantly different, it's unclear if we should fuse them or not. For a trivial fusion like this, it's definitely better to fuse. For more complex fusions, we may be able to generate more efficient segmented kernels. At this point, this is not something I'm trying to address.

@naoyam
Copy link
Collaborator Author

naoyam commented Dec 31, 2024

!build

@naoyam
Copy link
Collaborator Author

naoyam commented Dec 31, 2024

!build

@naoyam naoyam merged commit f9d0efa into main Dec 31, 2024
14 of 15 checks passed
@naoyam naoyam deleted the resize_scheduler_recomputation branch December 31, 2024 18:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants