-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TMA support for circular buffering pass #2833
Conversation
e223b8b
to
2491171
Compare
Awesome, detailed PR description. Thank you. |
* Add support for Hopper::electSync * Create ElectSync PredicateType * Make mbarrier synchronous * mbarrier waits for all threads in CTA * All threads issues arriveExpectTx to get mbarrier_token
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some minor comments from a first pass. I haven't looked at tests yet.
Do we expect to mix The pointwise, reduction, and persistent fusions in my tests do not work with |
No, we do not expect that. I just want to understand what will be the behavior (will it just work? if not, are we correctly throwing errors saying that this is not supported?). |
* Create `TmaCircularBufferInfo` struct to consolidate data fields for TMA circular buffering. * Move shared memory allocations outside of circular buffering loop * Remove GatherMBarrierAllocations
!build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the great work and discussion!
if (hasCpAsyncBulk) { | ||
insertTma(loop, it->second); | ||
} else { | ||
insert(loop, it->second); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried
TEST_P(TmaCircularBufferingTest, Pointwise)
without any circular buffering loaded TV0 with TMA and TV1 withcp.async
and got incorrect results.
So TV0 is not circular buffered, and TV1 is?
Should we check that, all circular buffer load should have the same LoadStoreOpType
?
!build |
Summary
This PR adds support for TMA circular buffering. It is stacked on #2824 and #2825.
Tracking branch: #2773
Description
CloneTmaCircularBufferLoopAndInsertSync
clones operations and inserts mbarrier synchronization logic to create the prologue, main, and epilogue for-loops.arriveExpectTx
andarrive
expressions are created for cpAsyncBulk load operations.arriveExpectedTx
andarrive
for next stage, and callsmbarrierWait
for current stage.mbarrierWait
for remaining stages in the pipeline.Lowering Details
Description of changes in lowering passes.
Prologue
,Main
, andEpilogue
loops are created byCloneTmaCircularBufferLoopAndInsertSync
which is a child class ofCircularBufferLoopCloner
.PrePrologue
andPostEpilogue
loops are created in the allocation pass.cuTensorMapEncodeTiled
restricts the size of each box dimension to be<= 256
. You will need to launch multiple load operations to load larger tiles.mbarriers
for each stage, so theexpected_transaction
bytes is multiplied by the number of TMA loads per stage.mbarrier_wait
for the stage.Loop Structure
Description of for-loop structure for circular buffering.
Overview Circular Buffer Structure:
Pre-prologue loop:
Prologue loop:
Main loop:
Epilogue loop:
Post-epilogue loop:
Detailed Pseudo-Code:
Pre-Prologue loop:
Prologue loop:
Main loop:
Epilogue loop:
Post-Epilogue loop:
Testing Setup