Allocation changes for TMA Circular Buffering #2824

rdspring1 · 2024-08-21T19:09:42Z

Summary

It is the changes to the allocation lowering pass from #2773.

Details

GpuLower

ldst_mbarrier_token_map_ maps LoadStoreOp to mbarrier tokens, which are represented as TensorView of number of pipeline stages.
mbarrier_token_smem_alloc_set_ tracks the kir::Allocate expressions for the mbarriers and their tokens.
ldst_mbarrier_index_map_ maps the cloned LoadStoreOp in the prologue and main loops to their indexed mbarrier.

Allocation

In the allocation pass, create shared memory allocations and operations around LoadStoreOp expression.

// Created tokens, mbarriers, init, and inval operations in allocation pass.
for (circular_buffer_loop) {
    __shared__ int64_t tokens[num_stages];
    __shared__ int64_t mbarrier[num_stages];
    init(mbarrier);
    cp.async.bulk(data, mbarrier);
    inval(mbarrier);
}

AliasMemory

The mbarrier and its token are mapped together. The token is the mbarrier state of the last phase. For simplicity, mark token liveness when mbarrier is initialized and invalidated.
Apply markWrite for mbarrier and its token when the expression is MBarrierInit
Apply markRead for mbarrier and its token when the expression is MBarrierInvalidate

zasdfgbnm · 2024-08-22T17:08:51Z

csrc/device_lower/lower2device.h

+  // Keep track of placeholders for tokens returned by arrive/expected tx
+  // mbarrier operations for each load/store operation that requires such
+  // synchronization
+  std::unordered_map<const Expr*, TensorView*> ldst_mbarrier_token_map_;
+
+  // Collection of kir::Allocate for smem buffers used for mbarrier and token
+  // objects from cpAsyncBulk synchronization
+  std::unordered_set<const Expr*> mbarrier_token_smem_alloc_set_;
+
+  // Keep track what mbarrier object is used in load/store operation that
+  // requires such synchronization, required by indexing pass
+  std::unordered_map<const Expr*, kir::TensorIndex*> ldst_mbarrier_index_map_;


Should we put them together as a struct MBarrierInfo?

I'll make this change in a separate PR after #2824 and #2825 but before #2833.

zasdfgbnm · 2024-08-22T17:29:24Z

In the allocation pass, create shared memory allocations and operations around LoadStoreOp expression.

// Created tokens, mbarriers, init, and inval operations in allocation pass.
for (circular_buffer_loop) {
    __shared__ tokens[num_stages];
    __shared__ mbarrier[num_stages];
    init(mbarrier);
    cp.async.bulk(data, mbarrier);
    inval(mbarrier);
}

Could you add this as a comment in the code?

csrc/device_lower/pass/allocation.cpp

zasdfgbnm · 2024-08-22T17:26:49Z

csrc/device_lower/pass/allocation.cpp

+                .build();
+        mbarrier->setMemoryType(MemoryType::Shared);
+
+        // The wait condition for mbarrier is a single thread and the expected


Instead of assuming there will be only a single thread issuing TMA load, should we check the parallelization of expr->output(0) to determine that value?

The number of threads needs to a constant when we call arrive_expected_tx in the cuda kernel. The number of threads can be a runtime value, so defaulting to a single thread is the simplest thing.

I'm using a regular block_sync for synchronizing the CTA rather than the mbarrier because of this compile-time constraint.

I still don't understand. Could you elaborate more on:

The number of threads needs to a constant when we call arrive_expected_tx in the cuda kernel.

The number of threads needs to a constant when we call arrive_expected_tx in the cuda kernel.

I was wrong about this ^^^.

I think your proposal is this:

kir::MBarrierInit* mbarrier_init = IrBuilder::create<kir::MBarrierInit>( mbarrier, simplifyExpr(SimplifyingIrBuilder::maybeCastExpr( DataType::UInt32, lower_utils::getNumThreadsInTensorView( expr->output(0)->as<TensorView>()))));

I'm fine with this change.

naoyam

LGTM. I defer to @zasdfgbnm.

rdspring1 · 2024-08-27T16:00:54Z

!build

zasdfgbnm

Stamping. I don't think there is anything in this PR that can not be resolved in a separate PR.

rdspring1 · 2024-09-03T21:11:48Z

!build

## Summary ## It is the changes to the indexing lowering pass from #2773. It is stacked on #2824. Tracking Branch: #2773 ## Details ## - In the circular buffering pass, we manually index the mbarriers and tokens using the index of the prologue, main, and epilogue loops. ```cpp for (int index : c10::irange(fl->extent()) { int stage = index % number_of_pipeline_stages; mbarrier_t current_stage_mbarrier = mbarriers[stage]; // represented with kir::TensorIndex int next_stage = (index + number_of_stages - 1) % number_of_pipeline_stages; mbarrier_t next_stage_mbarrier = mbarriers[next_stage]; // represented with kir::TensorIndex } ``` - The handle functions for `kir::MBarrierInit`, `kir::MBarrierInvalidate`, `kir::MBarrierArriveExpectTx`, and `kir::MBarrierWait` are modified to handle `kir::TensorIndex`. - `u32IndexScalarSmemTv` is modified to get the shared memory pointer address for a `kir::TensorIndex`.

## Summary ## This PR adds support for TMA circular buffering. It is stacked on #2824 and #2825. Tracking branch: #2773 ## Description ## - The Pre-Prologue and Post-Epilogue loops are created in the allocation pass. - Pre-Prologue loop allocates share memory and initializes mbarriers, while Post-Epilogue loop invalidates mbarriers. - In the circular buffer pass, `CloneTmaCircularBufferLoopAndInsertSync` clones operations and inserts mbarrier synchronization logic to create the prologue, main, and epilogue for-loops. - Prologue copies only the load operations. `arriveExpectTx` and `arrive` expressions are created for cpAsyncBulk load operations. - Main loop copies the load and computation operations, adds `arriveExpectedTx` and `arrive` for next stage, and calls `mbarrierWait` for current stage. - Epilogue copies only the computation operations and adds `mbarrierWait` for remaining stages in the pipeline. ## Lowering Details ## Description of changes in lowering passes. - `Prologue`, `Main`, and `Epilogue` loops are created by `CloneTmaCircularBufferLoopAndInsertSync` which is a child class of `CircularBufferLoopCloner`. - `PrePrologue` and `PostEpilogue` loops are created in the allocation pass. - The `cuTensorMapEncodeTiled ` restricts the size of each box dimension to be `<= 256`. You will need to launch multiple load operations to load larger tiles. - We only allocate `mbarriers` for each stage, so the `expected_transaction` bytes is multiplied by the number of TMA loads per stage. - The for-loop cloner must account for the nested for-loop structure used to launch multiple TMA loads before adding the `mbarrier_wait` for the stage. ## Loop Structure ## Description of for-loop structure for circular buffering. <details> <summary>Overview Circular Buffer Structure:</summary> ### Pre-prologue loop: ### - Allocate shared memory for mbarriers and mbarrier tokens - Initialize mbarrier for all stages ### Prologue loop: ### - if selected_thread: - Issue cp async bulks for all but last stage ### Main loop: ### - if selected_thread: - Issue next cp async bulk for available stage - All threads wait until tma operation arrives - Copy body without - shared memory allocations - mbarrier_init exprs - mbarrier_inval exprs ### Epilogue loop: ### - All threads wait until tma operation arrives - Copy body without - shared memory allocations - issuing cp async bulk operations - mbarrier_init exprs - mbarrier_inval exprs ### Post-epilogue loop: ### - if selected_thread: - Invalidated mbarrier for all stages </details> <details> <summary>Detailed Pseudo-Code:</summary> ```cpp constexpr int64_t warp_size = 32; bool first_warp = threadIdx.x < warp_size && threadIdx.y == 0 && threadIdx.z == 0; ``` ### Pre-Prologue loop: ### ```cpp __shared__ __mbarrier_t barriers[num_stages]; __shared__ __mbarrier_token_t tokens[num_stages]; for (int64_t loop_index : irange(stages)) { if (first_warp && hopper::electSync()) { mbarrier_init(mbarrier[loop_index], number_of_arrival_threads); } } ``` ### Prologue loop: ### ```cpp // Launch loads for the first stages-1 for (int64_t loop_index : irange(stages-1)) { if (first_warp && hopper::electSync()) { tokens[loop_index] = mbarrier::arriveExpectTx(mbarrier[loop_index]); cpAsyncBulk(mbarriers[loop_index], ...); } else { token[load_stage] = mbarrier::arrive(mbarrier[load_stage]); } } ``` ### Main loop: ### ```cpp // Launch load for last available stage. Wait for the current stage in pipeline. // Repeat for extent - (stages-1) iterations for (int64_t loop_index : irange(N-(stages-1))) { current_stage = loop_index % stage_depth load_stage = (loop_index + (stage_depth - 1)) % stage_depth) if (first_warp && hopper::electSync()) { token[load_stage] = mbarrier::arriveExpectTx(mbarrier[load_stage], expected_transaction_size); cpAsyncBulk(mbarrier[load_stage], ...); } else { token[load_stage] = mbarrier::arrive(mbarrier[load_stage]); } mbarrier::wait(token[current_stage]); // Clone remaining operations } ``` Epilogue loop: ```cpp // Wait for current stage in pipeline. Repeat for remaining iterations in extent. for (int64_t loop_index : irange(N-(stages-1), N)) { current_stage = loop_index % stage_depth mbarrier::wait(token[current_stage]); // Clone remaining operations } ``` ### Post-Epilogue loop: ### ```cpp for (int64_t loop_index : irange(stages)) { if (first_warp && hopper::electSync()) { mbarrier_inval(mbarrier[loop_index]); } } ``` </details> ## Testing Setup ## - 2 to 4 pipeline stages. - (128, 500, 1024) outer dimension. - (128, 1024) inner dimension. 1. Single Dim including Unroll and Unswitch parallelizations. 2. Multiple Dim 3. Pointwise - One Tensor is loaded with TMA circular buffering. The other tensor is loaded with Set circular buffering. 4. PointwiseCpAsync - One Tensor is loaded with TMA circular buffering. The other tensor is loaded with CpAsync circular buffering. This test is currently disabled, but will be fixed by #2339. 5. Reduction 6. InnerPersistent - In this schedule, the output TensorView of the cpAsyncBulk load has a serial iterDomain to the right of computeAt position. A for-loop will launch multiple TMA loads for each pipeline stage. 7. Matmul

This was referenced Aug 21, 2024

[Tracking] TMA Circular Buffering #2773

Closed

Add TMA support for circular buffering pass #2833

Merged

Indexing changes for TMA Circular Buffering #2825

Merged

Add allocation changes

12db3ee

rdspring1 force-pushed the tma_cb_alloc branch from ae8fa45 to 12db3ee Compare August 22, 2024 01:25

rdspring1 requested review from naoyam, zasdfgbnm and jacobhinkle August 22, 2024 16:28

zasdfgbnm reviewed Aug 22, 2024

View reviewed changes

naoyam reviewed Aug 22, 2024

View reviewed changes

zasdfgbnm approved these changes Aug 29, 2024

View reviewed changes

rdspring1 added 2 commits September 3, 2024 11:48

Add comment

10a498f

more comments

994aef1

rdspring1 merged commit bcdd181 into main Sep 4, 2024
36 checks passed

rdspring1 deleted the tma_cb_alloc branch September 4, 2024 16:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allocation changes for TMA Circular Buffering #2824

Allocation changes for TMA Circular Buffering #2824

rdspring1 commented Aug 21, 2024 •

edited

Loading

zasdfgbnm Aug 22, 2024

rdspring1 Sep 3, 2024

zasdfgbnm commented Aug 22, 2024

zasdfgbnm Aug 22, 2024

rdspring1 Aug 22, 2024

rdspring1 Aug 22, 2024

zasdfgbnm Aug 23, 2024

rdspring1 Aug 23, 2024

naoyam left a comment

rdspring1 commented Aug 27, 2024

zasdfgbnm left a comment

rdspring1 commented Sep 3, 2024

Allocation changes for TMA Circular Buffering #2824

Allocation changes for TMA Circular Buffering #2824

Conversation

rdspring1 commented Aug 21, 2024 • edited Loading

Summary

Details

GpuLower

Allocation

AliasMemory

zasdfgbnm Aug 22, 2024

Choose a reason for hiding this comment

rdspring1 Sep 3, 2024

Choose a reason for hiding this comment

zasdfgbnm commented Aug 22, 2024

zasdfgbnm Aug 22, 2024

Choose a reason for hiding this comment

rdspring1 Aug 22, 2024

Choose a reason for hiding this comment

rdspring1 Aug 22, 2024

Choose a reason for hiding this comment

zasdfgbnm Aug 23, 2024

Choose a reason for hiding this comment

rdspring1 Aug 23, 2024

Choose a reason for hiding this comment

naoyam left a comment

Choose a reason for hiding this comment

rdspring1 commented Aug 27, 2024

zasdfgbnm left a comment

Choose a reason for hiding this comment

rdspring1 commented Sep 3, 2024

rdspring1 commented Aug 21, 2024 •

edited

Loading