[Tracking] TMA Circular Buffering #2773

rdspring1 · 2024-08-07T19:52:59Z

Summary

This PR adds support for TMA circular buffering. It is based on #1484.

Description

In the allocation pass, create shared memory allocations and operations around LoadStoreOp expression.
In the circular buffer pass, clone operations to create the pre-prologue, prologue, main, epilogue, and post-epilogue for-loops.
Pre-Prologue allocates share memory and initializes mbarriers.
Prologue copies only the load operations.
Main loop copies the load and computation operations and adds arrive_expected_tx for next stage and mbarrier_wait for current stage.
Epilogue copies only the computation operations and adds mbarrier_wait for remaining stages in the pipeline.
Post-Epilogue invalidated mbarriers.

Lowering Details

Description of changes in lowering passes.

Prologue, Main, and Epilogue loops are created by TmaCircularBufferLoopCloner which is a child class of CircularBufferLoopCloner.
PrePrologue and PostEpilogue loops are created by createCpAsyncBulkFixtures.
The cuTensorMapEncodeTiled restricts the size of each box dimension to be <= 256. You need to launch multiple load operations to load larger tiles.
We only allocate mbarriers for each stage, so the expected_transaction bytes is multiplied by the number of TMA loads per stage.
The for-loop cloner must account for the nested for-loop structure used to launch multiple TMA loads before adding the mbarrier_wait for the stage.

Allocation Pass

// Created tokens, mbarriers, init, and inval operations in allocation pass.
for (circular_buffer_loop) {
    __shared__ tokens[num_stages];
    __shared__ mbarrier[num_stages];
    init(mbarrier);
    cp.async.bulk(data, mbarrier);
    inval(mbarrier);
}

Loop Structure

Description of for-loop structure for circular buffering.

Overview Circular Buffer Structure:

Pre-prologue loop:

Allocate shared memory for mbarriers and mbarrier tokens
Initialize mbarrier for all stages

Prologue loop:

if selected_thread:
- Issue cp async bulks for all but last stage

Main loop:

if selected_thread:
- Issue next cp async bulk for available stage
All threads wait until tma operation arrives
Copy body without
- shared memory allocations
- mbarrier_init exprs
- mbarrier_inval exprs

Epilogue loop:

All threads wait until tma operation arrives
Copy body without
- shared memory allocations
- issuing cp async bulk operations
- mbarrier_init exprs
- mbarrier_inval exprs

Post-epilogue loop:

if selected_thread:
Invalidated mbarrier for all stages

Detailed Pseudo-Code:

Pre-Prologue loop:

__shared__ __mbarrier_t barriers[num_stages];
__shared__ __mbarrier_token_t tokens[num_stages];
if (threadIdx.x == 0 && threadIdx.y == 0 && threadIdx.z == 0) {
  for (int64_t loop_index : irange(stages)) {
    mbarrier_init(mbarrier[loop_index], number_of_arrival_threads);
  }
}

Prologue loop:

if (threadIdx.x == 0 && threadIdx.y == 0 && threadIdx.z == 0) {
  for (int64_t loop_index : irange(stages-1)) {
    tokens[loop_index] = mbarrier::arriveExpectTx(mbarrier[loop_index]);
    cpAsyncBulk(mbarriers[loop_index], ...);
  }
}

Main loop:

for (int64_t loop_index : irange(N-(stages-1))) {
  current_stage = loop_index % stage_depth
  load_stage = (loop_index + (stage_depth - 1)) % stage_depth)
  if (threadIdx.x == 0 && threadIdx.y == 0 && threadIdx.z == 0) {
    token[load_stage] =
      mbarrier::arriveExpectTx(mbarrier[load_stage]);
    cpAsyncBulk(mbarrier[load_stage], ...);
  }
  mbarrier::wait(token[current_stage]);

  // Clone remaining operations
}

Epilogue loop:

for (int64_t loop_index : irange(N-(stages-1), N)) {
  current_stage = loop_index % stage_depth
  mbarrier::wait(token[current_stage]);

  // Clone remaining operations
}

Post-Epilogue loop:

if (threadIdx.x == 0 && threadIdx.y == 0 && threadIdx.z == 0) {
  for (int64_t loop_index : irange(stages)) {
    mbarrier_inval(mbarrier[loop_index]);
  }
}

Testing

2 to 4 pipeline stages.
(128, 500, 1024) outer dimension.
(128, 1024) inner dimension.

Single Dim including Unroll and Unswitch parallelizations.
Multiple Dim
Pointwise
Reduction
InnerPersistent
Matmul

Future PRs

Replace TDX == 0 && TDY == 0 && TDZ == 0 with Hopper::elect_sync PTX instruction.
Known Issue: Some LoadStore operations add additional TDY == 0 predicates that conflict with Hopper::elect_sync.

This PR refactors `CircularBufferLoopCloner` to avoid clang-tidy issues in #2773. - Track cloned for loop instead of its Scope - Add virtual methods `processExpr` and `processForLoop` for `TmaCircularBufferLoopCloner` to override. Details: ``` Error (CLANGTIDY) [bugprone-parent-virtual-call,-warnings-as-errors] qualified name 'kir::IrVisitor::dispatch' refers to a member overridden in subclass; did you mean 'nvfuser::CircularBufferLoopCloner'? ```

## Summary ## It is the changes to the allocation lowering pass from #2773. ## Details ## ### GpuLower ### - `ldst_mbarrier_token_map_` maps `LoadStoreOp` to mbarrier tokens, which are represented as `TensorView` of number of pipeline stages. - `mbarrier_token_smem_alloc_set_` tracks the `kir::Allocate` expressions for the mbarriers and their tokens. - `ldst_mbarrier_index_map_` maps the cloned `LoadStoreOp` in the prologue and main loops to their indexed mbarrier. ### Allocation ### - In the allocation pass, create shared memory allocations and operations around `LoadStoreOp` expression. ```cpp // Created tokens, mbarriers, init, and inval operations in allocation pass. for (circular_buffer_loop) { __shared__ int64_t tokens[num_stages]; __shared__ int64_t mbarrier[num_stages]; init(mbarrier); cp.async.bulk(data, mbarrier); inval(mbarrier); } ``` ## AliasMemory ## - The mbarrier and its token are mapped together. The token is the mbarrier state of the last phase. For simplicity, mark token liveness when mbarrier is initialized and invalidated. - Apply `markWrite` for mbarrier and its token when the expression is `MBarrierInit` - Apply `markRead` for mbarrier and its token when the expression is `MBarrierInvalidate`

## Summary ## It is the changes to the indexing lowering pass from #2773. It is stacked on #2824. Tracking Branch: #2773 ## Details ## - In the circular buffering pass, we manually index the mbarriers and tokens using the index of the prologue, main, and epilogue loops. ```cpp for (int index : c10::irange(fl->extent()) { int stage = index % number_of_pipeline_stages; mbarrier_t current_stage_mbarrier = mbarriers[stage]; // represented with kir::TensorIndex int next_stage = (index + number_of_stages - 1) % number_of_pipeline_stages; mbarrier_t next_stage_mbarrier = mbarriers[next_stage]; // represented with kir::TensorIndex } ``` - The handle functions for `kir::MBarrierInit`, `kir::MBarrierInvalidate`, `kir::MBarrierArriveExpectTx`, and `kir::MBarrierWait` are modified to handle `kir::TensorIndex`. - `u32IndexScalarSmemTv` is modified to get the shared memory pointer address for a `kir::TensorIndex`.

…lar_buffering

* Add support for Hopper::electSync * Create ElectSync PredicateType * Make mbarrier synchronous * mbarrier waits for all threads in CTA * All threads issues arriveExpectTx to get mbarrier_token

rdspring1 · 2024-09-05T17:40:03Z

!build

rdspring1 · 2024-09-05T20:58:32Z

!build

## Summary ## This PR adds support for TMA circular buffering. It is stacked on #2824 and #2825. Tracking branch: #2773 ## Description ## - The Pre-Prologue and Post-Epilogue loops are created in the allocation pass. - Pre-Prologue loop allocates share memory and initializes mbarriers, while Post-Epilogue loop invalidates mbarriers. - In the circular buffer pass, `CloneTmaCircularBufferLoopAndInsertSync` clones operations and inserts mbarrier synchronization logic to create the prologue, main, and epilogue for-loops. - Prologue copies only the load operations. `arriveExpectTx` and `arrive` expressions are created for cpAsyncBulk load operations. - Main loop copies the load and computation operations, adds `arriveExpectedTx` and `arrive` for next stage, and calls `mbarrierWait` for current stage. - Epilogue copies only the computation operations and adds `mbarrierWait` for remaining stages in the pipeline. ## Lowering Details ## Description of changes in lowering passes. - `Prologue`, `Main`, and `Epilogue` loops are created by `CloneTmaCircularBufferLoopAndInsertSync` which is a child class of `CircularBufferLoopCloner`. - `PrePrologue` and `PostEpilogue` loops are created in the allocation pass. - The `cuTensorMapEncodeTiled ` restricts the size of each box dimension to be `<= 256`. You will need to launch multiple load operations to load larger tiles. - We only allocate `mbarriers` for each stage, so the `expected_transaction` bytes is multiplied by the number of TMA loads per stage. - The for-loop cloner must account for the nested for-loop structure used to launch multiple TMA loads before adding the `mbarrier_wait` for the stage. ## Loop Structure ## Description of for-loop structure for circular buffering. <details> <summary>Overview Circular Buffer Structure:</summary> ### Pre-prologue loop: ### - Allocate shared memory for mbarriers and mbarrier tokens - Initialize mbarrier for all stages ### Prologue loop: ### - if selected_thread: - Issue cp async bulks for all but last stage ### Main loop: ### - if selected_thread: - Issue next cp async bulk for available stage - All threads wait until tma operation arrives - Copy body without - shared memory allocations - mbarrier_init exprs - mbarrier_inval exprs ### Epilogue loop: ### - All threads wait until tma operation arrives - Copy body without - shared memory allocations - issuing cp async bulk operations - mbarrier_init exprs - mbarrier_inval exprs ### Post-epilogue loop: ### - if selected_thread: - Invalidated mbarrier for all stages </details> <details> <summary>Detailed Pseudo-Code:</summary> ```cpp constexpr int64_t warp_size = 32; bool first_warp = threadIdx.x < warp_size && threadIdx.y == 0 && threadIdx.z == 0; ``` ### Pre-Prologue loop: ### ```cpp __shared__ __mbarrier_t barriers[num_stages]; __shared__ __mbarrier_token_t tokens[num_stages]; for (int64_t loop_index : irange(stages)) { if (first_warp && hopper::electSync()) { mbarrier_init(mbarrier[loop_index], number_of_arrival_threads); } } ``` ### Prologue loop: ### ```cpp // Launch loads for the first stages-1 for (int64_t loop_index : irange(stages-1)) { if (first_warp && hopper::electSync()) { tokens[loop_index] = mbarrier::arriveExpectTx(mbarrier[loop_index]); cpAsyncBulk(mbarriers[loop_index], ...); } else { token[load_stage] = mbarrier::arrive(mbarrier[load_stage]); } } ``` ### Main loop: ### ```cpp // Launch load for last available stage. Wait for the current stage in pipeline. // Repeat for extent - (stages-1) iterations for (int64_t loop_index : irange(N-(stages-1))) { current_stage = loop_index % stage_depth load_stage = (loop_index + (stage_depth - 1)) % stage_depth) if (first_warp && hopper::electSync()) { token[load_stage] = mbarrier::arriveExpectTx(mbarrier[load_stage], expected_transaction_size); cpAsyncBulk(mbarrier[load_stage], ...); } else { token[load_stage] = mbarrier::arrive(mbarrier[load_stage]); } mbarrier::wait(token[current_stage]); // Clone remaining operations } ``` Epilogue loop: ```cpp // Wait for current stage in pipeline. Repeat for remaining iterations in extent. for (int64_t loop_index : irange(N-(stages-1), N)) { current_stage = loop_index % stage_depth mbarrier::wait(token[current_stage]); // Clone remaining operations } ``` ### Post-Epilogue loop: ### ```cpp for (int64_t loop_index : irange(stages)) { if (first_warp && hopper::electSync()) { mbarrier_inval(mbarrier[loop_index]); } } ``` </details> ## Testing Setup ## - 2 to 4 pipeline stages. - (128, 500, 1024) outer dimension. - (128, 1024) inner dimension. 1. Single Dim including Unroll and Unswitch parallelizations. 2. Multiple Dim 3. Pointwise - One Tensor is loaded with TMA circular buffering. The other tensor is loaded with Set circular buffering. 4. PointwiseCpAsync - One Tensor is loaded with TMA circular buffering. The other tensor is loaded with CpAsync circular buffering. This test is currently disabled, but will be fixed by #2339. 5. Reduction 6. InnerPersistent - In this schedule, the output TensorView of the cpAsyncBulk load has a serial iterDomain to the right of computeAt position. A for-loop will launch multiple TMA loads for each pipeline stage. 7. Matmul

drzejan2 and others added 30 commits May 24, 2024 14:08

[DB] TMA-based double buffer 1d/2d test cases

37ae0d8

[DB] Index and host side parts of TMA initialization, initial changes

c1eaec0

[DB] Allocations' for DB/CB synchonization, prepartions

0837fed

[DB] KIR mutator update in double buffering pass

6fb1869

modify test

e854280

remove assertion

7006625

Get pointer address for mbarrier vector

e587ec2

use stop backup

179dc7f

Fix prologue loop

eee7af8

replace auto with correct types

9afce9f

fix map

c4e6a30

add index pass through for mbarrier wait and arrive_transaction

591ccb6

reorder epilogue and post-epilogue loops

e6aa291

refactor

3c94d4d

add special case when number of stages is 2 in prologue

3d07494

update test

f3ee3d8

use toSmem for arrive_expected_tx and mwait

ba2aaea

use toSmem for tma load store op

1dc37cd

update tma double buffering 2d test

c268b53

update allocation pass

b3488ea

update alias memory pass

82a23a4

refactor double buffer pass - pt1

ad048e0

refactor double buffer pass - pt2

92d5851

double buffer comments p3

f785286

save

6129d76

save examples

1404fe4

add print

30ed64d

use indexOrStartIfTrivial

2d81771

Add mbarrier-wait for epilogue

9a0676e

Get correct circular buffer stage

62c8f1c

rdspring1 added 7 commits August 21, 2024 10:42

comments

aaf1e16

fix

e2f64f0

remove

bb476b4

refactor

5116ccc

refactor

de865c3

Add check to allocation

e86e5db

remove

b703be6

This was referenced Aug 21, 2024

Refactor CircularBufferLoopCloner #2823

Merged

Allocation changes for TMA Circular Buffering #2824

Merged

Indexing changes for TMA Circular Buffering #2825

Merged

rdspring1 changed the title ~~TMA Circular Buffering~~ [Tracking] TMA Circular Buffering Aug 22, 2024

rdspring1 mentioned this pull request Aug 22, 2024

Add TMA support for circular buffering pass #2833

Merged

rdspring1 added 2 commits August 22, 2024 12:52

Merge branch 'main' into tma_circular_buffering

97ffa1e

Add thread check

c9c7cc0

rdspring1 added 3 commits September 5, 2024 09:56

Merge branch 'main' of https://github.com/nvidia/fuser into tma_circu…

6036b36

…lar_buffering

remove

5ccf3d9

predicate and mbarrier changes

b9b6193

* Add support for Hopper::electSync * Create ElectSync PredicateType * Make mbarrier synchronous * mbarrier waits for all threads in CTA * All threads issues arriveExpectTx to get mbarrier_token

refactor

8541ec8

rdspring1 added 3 commits September 8, 2024 15:58

modify pointwise test

b823815

add mbarrier_wait immediately

00391be

skip expressions_allocated_in_main_loop

5e38511

rdspring1 force-pushed the tma_circular_buffering branch from 7f9e5f3 to 5e38511 Compare September 8, 2024 23:11

rdspring1 closed this Oct 1, 2024

rdspring1 deleted the tma_circular_buffering branch October 29, 2024 21:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tracking] TMA Circular Buffering #2773

[Tracking] TMA Circular Buffering #2773

rdspring1 commented Aug 7, 2024 •

edited

Loading

rdspring1 commented Sep 5, 2024

rdspring1 commented Sep 5, 2024

[Tracking] TMA Circular Buffering #2773

[Tracking] TMA Circular Buffering #2773

Conversation

rdspring1 commented Aug 7, 2024 • edited Loading

Summary

Description

Lowering Details

Loop Structure

Pre-prologue loop:

Prologue loop:

Main loop:

Epilogue loop:

Post-epilogue loop:

Pre-Prologue loop:

Prologue loop:

Main loop:

Post-Epilogue loop:

Testing

Future PRs

rdspring1 commented Sep 5, 2024

rdspring1 commented Sep 5, 2024

rdspring1 commented Aug 7, 2024 •

edited

Loading