-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit unrolling of all circular buffered loops to depth equal to prefetch #3627
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
!test |
I think the unroll factor should be the prefetch stage for prologue and epilogue loops. Maybe for the main loop too. const auto& opt = GpuLower::current()->circularBufferInfo().getCircularBufferOptionsFor(
circular_buffer_loop_->iter_domain();
int64_t prologue_unroll = opt.prefetch; |
jacobhinkle
changed the title
Limit unrolling of static circ buffered main loops
Limit unrolling of all circular buffered loops to depth equal to prefetch
Dec 20, 2024
!test --diff |
Marking as draft. Apparently |
…uring lowering If there's no lowering, it means we're looking up the circ buffer options after lowering, so this is already being called on a ForLoop ID.
!test --diff |
!test --diff |
rdspring1
approved these changes
Dec 23, 2024
!test --diff |
There are other test cases where this increases spilling: --- 02ffc838
+++ c05b6c17
@@ -117,11 +117,11 @@
#pragma unroll
for(nvfuser_index_t i55 = 0; i55 < 8; ++i55) {
((*reinterpret_cast<Array<float, 4, 1>*>(&T7[(i54 + (4 * i55))]))).set(0);
}
}
- #pragma unroll
+ #pragma unroll 3
for(nvfuser_index_t i56 = 0; i56 < 3; ++i56) {
nvfuser_index_t i57;
i57 = 32 * i56;
__half* ptr58;
ptr58 = ptr9 + i57;
@@ -167,11 +167,11 @@
}
asm volatile("cp.async.commit_group;\n");
}
asm volatile("cp.async.wait_group %0;\n"::"n"(2LL));
__syncthreads();
- #pragma unroll 4
+ #pragma unroll 3
for(nvfuser_index_t i66 = 0; i66 < i0; ++i66) {
nvfuser_index_t i67;
i67 = 32 * i66;
__half* ptr68;
ptr68 = ptr22 + i67; |
This was referenced Jan 2, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently for dynamic shapes with circular buffered loops we unroll the following loops to different depths:
#pragma unroll
probably due to use ofensureStaticIndexing
in the indexing pass since this loop always has constant extent.#pragma unroll stages
#pragma unroll
similar to epilogue.This PR unrolls each of these loops explicitly by
#pragma prefetch
where prefetch is the circular buffering prefetch distance which is usually set tostages - 1
.Motivation
When using static shapes like in Fusions we receive from Thunder, I noticed that our matmul main loops are being fully unrolled (at least this is requested but the compiler likely does not fully unroll). For example I have seen this:
This particular kernel took 35 seconds to compile. After this change, we will instead do the following:
and the compile time is under 400 ms with no change to kernel runtime.