-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Double buffering for Hopper arch #1484
Conversation
a4a6204
to
c112933
Compare
c112933
to
b50babf
Compare
eafcf15
to
0cedfc3
Compare
I changed the approach for adding the support Hopper TMA double buffering / circular buffering - I decided to focus on how the final kernel should look like. Notes:
The generated kernel is a result of running the following test case (link). @naoyam can I ask you to check if prepared kernels make sense? I will start working on preparing changes in the infrastructure to produce code like below. @zasdfgbnm for visibility. Double buffer kernel
Circular buffer - there is notion of Circular buffer kernel
|
77df1a8
to
73706fe
Compare
15a17a6
to
3740e5d
Compare
3740e5d
to
69c6c44
Compare
f1ed7d2
to
992215a
Compare
b0fd2be
to
c94352a
Compare
The state of the implementation, as of 02/22/2024:
@naoyam, @zasdfgbnm - can I ask you to check if the direction I'm heading is aligned with Fuser current and future architecture? Thanks! NOTE: for the time being, please ignore changes in The overview of how result cloned loops should look like:
The scope of kir changes in prolog loop:
The scope of kir changes in main loop:
|
c94352a
to
6a0a656
Compare
It looks reasonable to me. At least, it seems like a natural extension of the current double buffering pass. |
// Register life time start for a smem placeholder with tokens | ||
// returned by MBarrierArriveExpectTx / MBarrierArrive | ||
if (GpuLower::current()->ldstMBarrierTokenMap().count(expr)) { | ||
markWrite(GpuLower::current()->ldstMBarrierTokenMap()[expr]); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to special handle the token's liveness? Will it just work if we use the token's first write and last read?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mbarrier and its token are mapped together. The token is the mbarrier state of the last phase.
Token liveness doesn't seem a strong priority. It is only a single int64_t
.
// Find the largest circular buffer depth, to be used for bulk load | ||
// allocation | ||
if (out_tv->isDoubleBuffered()) { | ||
if (out_tv->isCircularBuffered()) { | ||
circular_buffer_depth = | ||
std::max(circular_buffer_depth, out_tv->circularBufferDepth()); | ||
} else { | ||
circular_buffer_depth = std::max(circular_buffer_depth, 2u); | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it supported to have multiple output with different circular buffer depth? If not, should we add an NVF_ERROR
checking that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, that you have pointed this out, it really makes no sense to pick the largest stage. In fact, each circular buffer should be considered separately (and own set of mbarriers/tokens). I will update this part of code.
Thanks!
registerInsertBefore(expr, mbarrier_tokens_alloc, expr_scope); | ||
registerInsertBefore(expr, mbarrier_alloc, expr_scope); | ||
registerInsertBefore(expr, mbarrier_init, expr_scope); | ||
registerInsertAfter(expr, mbarrier_inval, expr_scope); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we are doing the init and inval inside the loop of cp.async, i.e. generate code like:
for i in range(I0):
__shared__ tokens;
__shared__ mbarrier;
init(mbarrier);
cp.async.bulk(data, mbarrier);
inval(mbarrier);
and leave it for the double bufferring pass to fix it? Could you leave a comment in the code for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's the current implementation - I will update the code here so it is clear that Double Buffering pass modify this part of the code. Having it done here has the benefit of easier control of shared memory allocations / storage liveness (the order of passes is insertAllocations
-> reuseMemoryAllocations
-> DoubleBufferPass
).
Switching to approach, where all needed elements are added in double buffering pass (we detect LoadStoreOp
for bulk copy op -> we add placeholder for tokes, mbarriers, init mbarrier and later inval for mbarriers) will have major consequences: adding to the system smem allocation (mbarrier / tokens) and life time analysis out-of-the order (pass-wise) might be a challenge.
I will think how we could have this done, in a clean way.
Thank you for the comment!
// mbarrier::init(...); | ||
// } | ||
// } | ||
class CpAsyncBulkPrePrologue : public kir::IrVisitor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a DoubleBufferLoopStage::PrePrologue
and a DoubleBufferLoopStage::PostEpilogue
and move the code below to DoubleBufferLoopCloner
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point.
For now I have it as separate class of functional classes - the main reason is that DoubleBufferLoopCloner
returns kir::ForLoop
while pre-prologue/post-epilogue are a simple collection (a list) of expressions. But you are right, we could have this done in smarter way where these two new 'regions' are handled along prolog/main/epilogue. I will try to unify this part of code, or at least leave a comment that architecture of this part of system should be reconsidered.
Thanks!
In our current main branch, all predicates of `cp.async.bulk` are skipped. It is skipped not because it should be like that, but instead, it is just a quick simple hack to allow us to incrementally build out TMA. Currently, TMA can only be used in a `<<<1, 1>>>` kernel, and it can only be used to copy the entire tensor, instead of copying a part of that tensor. Under this limitation, it totally makes sense to skip the predicates. However, it no longer makes sense to skip predicate generation for TMA as we are adding support for non-trivial cases. For example, in #1484, an `if (threadIdx.x == 0 && threadIdx.x == 0 && threadIdx.x == 0)` is manually created in the double buffering pass as a temporary solution. Also, I just started working on allowing TMA to be used in a non-`<<<1, 1>>>` kernel, where a thread predicate is clearly needed. In this PR, I am re-enabling predicate generation for TMA. For all the code that is already in main branch, this PR should be a no-op. I do not expect any change in the generated code for any TMA test. However, #1484 will be impacted in the sense that the `if (threadIdx.x == 0 && threadIdx.x == 0 && threadIdx.x == 0)` should no longer be created manually in the double-buffering pass, but instead, the double-buffering pass should leave the TMA op as-is, and the predicate generation pass will handle it.
6a0a656
to
3d26c04
Compare
3d26c04
to
f8cb081
Compare
@drzejan2 I was reviewing this PR to see how it could be adapted for pointwise and normalization kernels. Is the PR in a usable state? What functionality is not yet implemented? |
f8cb081
to
bbdc753
Compare
bbdc753
to
6fb1869
Compare
Closed in lieu of #2833 |
The items of this change: