-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allocation changes for TMA Circular Buffering #2824
Conversation
ae8fa45
to
12db3ee
Compare
// Keep track of placeholders for tokens returned by arrive/expected tx | ||
// mbarrier operations for each load/store operation that requires such | ||
// synchronization | ||
std::unordered_map<const Expr*, TensorView*> ldst_mbarrier_token_map_; | ||
|
||
// Collection of kir::Allocate for smem buffers used for mbarrier and token | ||
// objects from cpAsyncBulk synchronization | ||
std::unordered_set<const Expr*> mbarrier_token_smem_alloc_set_; | ||
|
||
// Keep track what mbarrier object is used in load/store operation that | ||
// requires such synchronization, required by indexing pass | ||
std::unordered_map<const Expr*, kir::TensorIndex*> ldst_mbarrier_index_map_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we put them together as a struct MBarrierInfo
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add this as a comment in the code? |
.build(); | ||
mbarrier->setMemoryType(MemoryType::Shared); | ||
|
||
// The wait condition for mbarrier is a single thread and the expected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of assuming there will be only a single thread issuing TMA load, should we check the parallelization of expr->output(0)
to determine that value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of threads needs to a constant when we call arrive_expected_tx
in the cuda kernel. The number of threads can be a runtime value, so defaulting to a single thread is the simplest thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using a regular block_sync
for synchronizing the CTA rather than the mbarrier
because of this compile-time constraint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't understand. Could you elaborate more on:
The number of threads needs to a constant when we call arrive_expected_tx in the cuda kernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of threads needs to a constant when we call arrive_expected_tx in the cuda kernel.
I was wrong about this ^^^.
I think your proposal is this:
kir::MBarrierInit* mbarrier_init = IrBuilder::create<kir::MBarrierInit>(
mbarrier,
simplifyExpr(SimplifyingIrBuilder::maybeCastExpr(
DataType::UInt32,
lower_utils::getNumThreadsInTensorView(
expr->output(0)->as<TensorView>()))));
I'm fine with this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I defer to @zasdfgbnm.
!build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stamping. I don't think there is anything in this PR that can not be resolved in a separate PR.
!build |
## Summary ## It is the changes to the indexing lowering pass from #2773. It is stacked on #2824. Tracking Branch: #2773 ## Details ## - In the circular buffering pass, we manually index the mbarriers and tokens using the index of the prologue, main, and epilogue loops. ```cpp for (int index : c10::irange(fl->extent()) { int stage = index % number_of_pipeline_stages; mbarrier_t current_stage_mbarrier = mbarriers[stage]; // represented with kir::TensorIndex int next_stage = (index + number_of_stages - 1) % number_of_pipeline_stages; mbarrier_t next_stage_mbarrier = mbarriers[next_stage]; // represented with kir::TensorIndex } ``` - The handle functions for `kir::MBarrierInit`, `kir::MBarrierInvalidate`, `kir::MBarrierArriveExpectTx`, and `kir::MBarrierWait` are modified to handle `kir::TensorIndex`. - `u32IndexScalarSmemTv` is modified to get the shared memory pointer address for a `kir::TensorIndex`.
## Summary ## This PR adds support for TMA circular buffering. It is stacked on #2824 and #2825. Tracking branch: #2773 ## Description ## - The Pre-Prologue and Post-Epilogue loops are created in the allocation pass. - Pre-Prologue loop allocates share memory and initializes mbarriers, while Post-Epilogue loop invalidates mbarriers. - In the circular buffer pass, `CloneTmaCircularBufferLoopAndInsertSync` clones operations and inserts mbarrier synchronization logic to create the prologue, main, and epilogue for-loops. - Prologue copies only the load operations. `arriveExpectTx` and `arrive` expressions are created for cpAsyncBulk load operations. - Main loop copies the load and computation operations, adds `arriveExpectedTx` and `arrive` for next stage, and calls `mbarrierWait` for current stage. - Epilogue copies only the computation operations and adds `mbarrierWait` for remaining stages in the pipeline. ## Lowering Details ## Description of changes in lowering passes. - `Prologue`, `Main`, and `Epilogue` loops are created by `CloneTmaCircularBufferLoopAndInsertSync` which is a child class of `CircularBufferLoopCloner`. - `PrePrologue` and `PostEpilogue` loops are created in the allocation pass. - The `cuTensorMapEncodeTiled ` restricts the size of each box dimension to be `<= 256`. You will need to launch multiple load operations to load larger tiles. - We only allocate `mbarriers` for each stage, so the `expected_transaction` bytes is multiplied by the number of TMA loads per stage. - The for-loop cloner must account for the nested for-loop structure used to launch multiple TMA loads before adding the `mbarrier_wait` for the stage. ## Loop Structure ## Description of for-loop structure for circular buffering. <details> <summary>Overview Circular Buffer Structure:</summary> ### Pre-prologue loop: ### - Allocate shared memory for mbarriers and mbarrier tokens - Initialize mbarrier for all stages ### Prologue loop: ### - if selected_thread: - Issue cp async bulks for all but last stage ### Main loop: ### - if selected_thread: - Issue next cp async bulk for available stage - All threads wait until tma operation arrives - Copy body without - shared memory allocations - mbarrier_init exprs - mbarrier_inval exprs ### Epilogue loop: ### - All threads wait until tma operation arrives - Copy body without - shared memory allocations - issuing cp async bulk operations - mbarrier_init exprs - mbarrier_inval exprs ### Post-epilogue loop: ### - if selected_thread: - Invalidated mbarrier for all stages </details> <details> <summary>Detailed Pseudo-Code:</summary> ```cpp constexpr int64_t warp_size = 32; bool first_warp = threadIdx.x < warp_size && threadIdx.y == 0 && threadIdx.z == 0; ``` ### Pre-Prologue loop: ### ```cpp __shared__ __mbarrier_t barriers[num_stages]; __shared__ __mbarrier_token_t tokens[num_stages]; for (int64_t loop_index : irange(stages)) { if (first_warp && hopper::electSync()) { mbarrier_init(mbarrier[loop_index], number_of_arrival_threads); } } ``` ### Prologue loop: ### ```cpp // Launch loads for the first stages-1 for (int64_t loop_index : irange(stages-1)) { if (first_warp && hopper::electSync()) { tokens[loop_index] = mbarrier::arriveExpectTx(mbarrier[loop_index]); cpAsyncBulk(mbarriers[loop_index], ...); } else { token[load_stage] = mbarrier::arrive(mbarrier[load_stage]); } } ``` ### Main loop: ### ```cpp // Launch load for last available stage. Wait for the current stage in pipeline. // Repeat for extent - (stages-1) iterations for (int64_t loop_index : irange(N-(stages-1))) { current_stage = loop_index % stage_depth load_stage = (loop_index + (stage_depth - 1)) % stage_depth) if (first_warp && hopper::electSync()) { token[load_stage] = mbarrier::arriveExpectTx(mbarrier[load_stage], expected_transaction_size); cpAsyncBulk(mbarrier[load_stage], ...); } else { token[load_stage] = mbarrier::arrive(mbarrier[load_stage]); } mbarrier::wait(token[current_stage]); // Clone remaining operations } ``` Epilogue loop: ```cpp // Wait for current stage in pipeline. Repeat for remaining iterations in extent. for (int64_t loop_index : irange(N-(stages-1), N)) { current_stage = loop_index % stage_depth mbarrier::wait(token[current_stage]); // Clone remaining operations } ``` ### Post-Epilogue loop: ### ```cpp for (int64_t loop_index : irange(stages)) { if (first_warp && hopper::electSync()) { mbarrier_inval(mbarrier[loop_index]); } } ``` </details> ## Testing Setup ## - 2 to 4 pipeline stages. - (128, 500, 1024) outer dimension. - (128, 1024) inner dimension. 1. Single Dim including Unroll and Unswitch parallelizations. 2. Multiple Dim 3. Pointwise - One Tensor is loaded with TMA circular buffering. The other tensor is loaded with Set circular buffering. 4. PointwiseCpAsync - One Tensor is loaded with TMA circular buffering. The other tensor is loaded with CpAsync circular buffering. This test is currently disabled, but will be fixed by #2339. 5. Reduction 6. InnerPersistent - In this schedule, the output TensorView of the cpAsyncBulk load has a serial iterDomain to the right of computeAt position. A for-loop will launch multiple TMA loads for each pipeline stage. 7. Matmul
Summary
It is the changes to the allocation lowering pass from #2773.
Details
GpuLower
ldst_mbarrier_token_map_
mapsLoadStoreOp
to mbarrier tokens, which are represented asTensorView
of number of pipeline stages.mbarrier_token_smem_alloc_set_
tracks thekir::Allocate
expressions for the mbarriers and their tokens.ldst_mbarrier_index_map_
maps the clonedLoadStoreOp
in the prologue and main loops to their indexed mbarrier.Allocation
LoadStoreOp
expression.AliasMemory
markWrite
for mbarrier and its token when the expression isMBarrierInit
markRead
for mbarrier and its token when the expression isMBarrierInvalidate