Add TMA support for circular buffering pass #2833

rdspring1 · 2024-08-22T00:35:50Z

Summary

This PR adds support for TMA circular buffering. It is stacked on #2824 and #2825.
Tracking branch: #2773

Description

The Pre-Prologue and Post-Epilogue loops are created in the allocation pass.
Pre-Prologue loop allocates share memory and initializes mbarriers, while Post-Epilogue loop invalidates mbarriers.
In the circular buffer pass, CloneTmaCircularBufferLoopAndInsertSync clones operations and inserts mbarrier synchronization logic to create the prologue, main, and epilogue for-loops.
Prologue copies only the load operations. arriveExpectTx and arrive expressions are created for cpAsyncBulk load operations.
Main loop copies the load and computation operations, adds arriveExpectedTx and arrive for next stage, and calls mbarrierWait for current stage.
Epilogue copies only the computation operations and adds mbarrierWait for remaining stages in the pipeline.

Lowering Details

Description of changes in lowering passes.

Prologue, Main, and Epilogue loops are created by CloneTmaCircularBufferLoopAndInsertSync which is a child class of CircularBufferLoopCloner.
PrePrologue and PostEpilogue loops are created in the allocation pass.
The cuTensorMapEncodeTiled restricts the size of each box dimension to be <= 256. You will need to launch multiple load operations to load larger tiles.
We only allocate mbarriers for each stage, so the expected_transaction bytes is multiplied by the number of TMA loads per stage.
The for-loop cloner must account for the nested for-loop structure used to launch multiple TMA loads before adding the mbarrier_wait for the stage.

Loop Structure

Description of for-loop structure for circular buffering.

Overview Circular Buffer Structure:

Pre-prologue loop:

Allocate shared memory for mbarriers and mbarrier tokens
Initialize mbarrier for all stages

Prologue loop:

if selected_thread:
- Issue cp async bulks for all but last stage

Main loop:

if selected_thread:
- Issue next cp async bulk for available stage
All threads wait until tma operation arrives
Copy body without
- shared memory allocations
- mbarrier_init exprs
- mbarrier_inval exprs

Epilogue loop:

All threads wait until tma operation arrives
Copy body without
- shared memory allocations
- issuing cp async bulk operations
- mbarrier_init exprs
- mbarrier_inval exprs

Post-epilogue loop:

if selected_thread:
- Invalidated mbarrier for all stages

Detailed Pseudo-Code:

constexpr int64_t warp_size = 32;
bool first_warp = threadIdx.x < warp_size && threadIdx.y == 0 && threadIdx.z == 0;

Pre-Prologue loop:

__shared__ __mbarrier_t barriers[num_stages];
__shared__ __mbarrier_token_t tokens[num_stages];
for (int64_t loop_index : irange(stages)) {
  if (first_warp && hopper::electSync()) {
    mbarrier_init(mbarrier[loop_index], number_of_arrival_threads);
  }
}

Prologue loop:

// Launch loads for the first stages-1
for (int64_t loop_index : irange(stages-1)) {
  if (first_warp && hopper::electSync()) {
    tokens[loop_index] = mbarrier::arriveExpectTx(mbarrier[loop_index]);
    cpAsyncBulk(mbarriers[loop_index], ...);
  } else {
    token[load_stage] = mbarrier::arrive(mbarrier[load_stage]);
  }
}

Main loop:

// Launch load for last available stage. Wait for the current stage in pipeline.
// Repeat for extent - (stages-1) iterations
for (int64_t loop_index : irange(N-(stages-1))) {
  current_stage = loop_index % stage_depth
  load_stage = (loop_index + (stage_depth - 1)) % stage_depth)
  if (first_warp && hopper::electSync()) {
    token[load_stage] =
      mbarrier::arriveExpectTx(mbarrier[load_stage], expected_transaction_size);
    cpAsyncBulk(mbarrier[load_stage], ...);
  } else {
    token[load_stage] = mbarrier::arrive(mbarrier[load_stage]);
  }
  mbarrier::wait(token[current_stage]);

  // Clone remaining operations
}

Epilogue loop:

// Wait for current stage in pipeline. Repeat for remaining iterations in extent.
for (int64_t loop_index : irange(N-(stages-1), N)) {
  current_stage = loop_index % stage_depth
  mbarrier::wait(token[current_stage]);

  // Clone remaining operations
}

Post-Epilogue loop:

for (int64_t loop_index : irange(stages)) {
  if (first_warp && hopper::electSync()) {
    mbarrier_inval(mbarrier[loop_index]);
  }
}

Testing Setup

2 to 4 pipeline stages.
(128, 500, 1024) outer dimension.
(128, 1024) inner dimension.

Single Dim including Unroll and Unswitch parallelizations.
Multiple Dim
Pointwise

One Tensor is loaded with TMA circular buffering. The other tensor is loaded with Set circular buffering.

PointwiseCpAsync

One Tensor is loaded with TMA circular buffering. The other tensor is loaded with CpAsync circular buffering. This test is currently disabled, but will be fixed by async copy save registers #2339.

Reduction
InnerPersistent

In this schedule, the output TensorView of the cpAsyncBulk load has a serial iterDomain to the right of computeAt position. A for-loop will launch multiple TMA loads for each pipeline stage.

Matmul

csarofeen · 2024-09-08T16:19:13Z

Awesome, detailed PR description. Thank you.

* Add support for Hopper::electSync * Create ElectSync PredicateType * Make mbarrier synchronous * mbarrier waits for all threads in CTA * All threads issues arriveExpectTx to get mbarrier_token

jacobhinkle

Just some minor comments from a first pass. I haven't looked at tests yet.

csrc/device_lower/pass/allocation.cpp

csrc/device_lower/pass/circular_buffer.cpp

csrc/executor.cpp

csrc/device_lower/pass/circular_buffer.cpp

csrc/device_lower/pass/predicate.cpp

rdspring1 · 2024-09-23T19:22:23Z

Do we expect to mix tma with cp.async in a kernel?

The pointwise, reduction, and persistent fusions in my tests do not work with cp.async regardless of circular buffering usage. It seems unusual but I don't see any NVF_ERRORs besides incorrect results.

zasdfgbnm · 2024-09-23T20:10:03Z

Do we expect to mix tma with cp.async in a kernel?

No, we do not expect that. I just want to understand what will be the behavior (will it just work? if not, are we correctly throwing errors saying that this is not supported?).

csrc/device_lower/pass/circular_buffer.cpp

csrc/device_lower/pass/allocation.cpp

* Create `TmaCircularBufferInfo` struct to consolidate data fields for TMA circular buffering. * Move shared memory allocations outside of circular buffering loop * Remove GatherMBarrierAllocations

csrc/device_lower/pass/allocation.cpp

rdspring1 · 2024-09-26T23:08:49Z

!build

zasdfgbnm

LGTM! Thanks for the great work and discussion!

zasdfgbnm · 2024-09-27T16:44:03Z

csrc/device_lower/pass/circular_buffer.cpp

+    if (hasCpAsyncBulk) {
+      insertTma(loop, it->second);
+    } else {
+      insert(loop, it->second);
+    }


I tried TEST_P(TmaCircularBufferingTest, Pointwise) without any circular buffering loaded TV0 with TMA and TV1 with cp.async and got incorrect results.

So TV0 is not circular buffered, and TV1 is?

Should we check that, all circular buffer load should have the same LoadStoreOpType?

csrc/device_lower/pass/circular_buffer.cpp

rdspring1 · 2024-09-29T16:38:30Z

!build

rdspring1 added 2 commits August 21, 2024 18:25

Add allocation changes

12db3ee

Add Indexing changes

2491171

rdspring1 force-pushed the tma_cb_index branch from e223b8b to 2491171 Compare August 22, 2024 01:26

rdspring1 changed the title ~~Add TMA support for circular buffering pass and testing~~ Add TMA support for circular buffering pass Aug 22, 2024

rdspring1 requested review from zasdfgbnm, jacobhinkle and naoyam August 22, 2024 16:29

Add circular buffering pass and testing

6d8ad5f

rdspring1 force-pushed the tma_cb branch from 26aceb9 to 6d8ad5f Compare August 23, 2024 19:18

rdspring1 mentioned this pull request Aug 23, 2024

Indexing changes for TMA Circular Buffering #2825

Merged

rdspring1 mentioned this pull request Sep 3, 2024

Allocation changes for TMA Circular Buffering #2824

Merged

Base automatically changed from tma_cb_index to main September 5, 2024 02:22

rdspring1 added 5 commits September 8, 2024 15:35

Merge branch 'main' of https://github.com/nvidia/fuser into tma_cb

25c482d

predicate and mbarrier changes

2f8d9e9

* Add support for Hopper::electSync * Create ElectSync PredicateType * Make mbarrier synchronous * mbarrier waits for all threads in CTA * All threads issues arriveExpectTx to get mbarrier_token

add mbarrier_wait immediately

2a06157

skip expressions_allocated_in_main_loop

0c8858f

Ensure a full warp exists if there is elect sync predicate

f8123af

jacobhinkle reviewed Sep 9, 2024

View reviewed changes

rdspring1 added 2 commits September 9, 2024 13:17

comments≈

508d674

Merge branch 'main' of https://github.com/nvidia/fuser into tma_cb

d4c7938

zasdfgbnm reviewed Sep 10, 2024

View reviewed changes

csrc/device_lower/pass/circular_buffer.cpp Outdated Show resolved Hide resolved

zasdfgbnm reviewed Sep 10, 2024

View reviewed changes

csrc/device_lower/pass/predicate.cpp Outdated Show resolved Hide resolved

naoyam mentioned this pull request Sep 11, 2024

Create ElectSync predicate type #2923

Merged

rdspring1 added 6 commits September 16, 2024 13:26

Add compatibility check for elect sync

ccfedfc

add test for elect sync compatibility

f29aa22

Use MBarrierArrive

f685a9b

comments

b84eb96

Add has_elect_sync_predicate to kernel_summary

c1fdec5

Merge branch 'main' of https://github.com/nvidia/fuser into tma_cb

4f011b5

rdspring1 force-pushed the tma_cb branch from 3bb1b95 to e8c7fd5 Compare September 23, 2024 18:21

rdspring1 added 2 commits September 23, 2024 11:28

use scalar hoisting

95e7bd0

rename

89b61bb

zasdfgbnm reviewed Sep 24, 2024

View reviewed changes

csrc/device_lower/pass/allocation.cpp Outdated Show resolved Hide resolved

rdspring1 added 4 commits September 25, 2024 08:53

Merge branch 'main' into tma_cb

444252d

Create TmaCircularBufferInfo to consolidate data fields. (#3004)

64ef3cb

* Create `TmaCircularBufferInfo` struct to consolidate data fields for TMA circular buffering. * Move shared memory allocations outside of circular buffering loop * Remove GatherMBarrierAllocations

Initialize and invalidate mbarrier in allocation pass

7ebffe3

comments

508dbb0

rdspring1 requested a review from zasdfgbnm September 26, 2024 02:18

zasdfgbnm reviewed Sep 26, 2024

View reviewed changes

csrc/device_lower/pass/allocation.cpp Outdated Show resolved Hide resolved

csrc/device_lower/pass/allocation.cpp Outdated Show resolved Hide resolved

csrc/device_lower/pass/allocation.cpp Outdated Show resolved Hide resolved

zasdfgbnm mentioned this pull request Sep 26, 2024

Enforce 128B alignment for TMA shared memory allocation. #3023

Merged

rdspring1 added 3 commits September 26, 2024 09:16

move to allocation pass

7ee80d3

create TmaCircularBufferInfo class

1f5bed1

Merge branch 'main' into tma_cb

95cbba1

rdspring1 requested a review from zasdfgbnm September 26, 2024 23:11

zasdfgbnm approved these changes Sep 27, 2024

View reviewed changes

rdspring1 added 3 commits September 27, 2024 10:49

rename CloneTmaCircularBufferLoopAndInsertSync

0a6abcd

comments

b9cc784

Add PointwiseCpAsync failing test

afb4e1c

rdspring1 mentioned this pull request Sep 27, 2024

async copy save registers #2339

Merged

rdspring1 added 2 commits September 28, 2024 09:01

Merge branch 'main' into tma_cb

3bcc32c

Merge branch 'main' into tma_cb

5df582a

rdspring1 merged commit 2cee59d into main Sep 29, 2024
39 of 41 checks passed

rdspring1 deleted the tma_cb branch September 29, 2024 19:53

rdspring1 mentioned this pull request Oct 16, 2024

Double buffering for Hopper arch #1484

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TMA support for circular buffering pass #2833

Add TMA support for circular buffering pass #2833

rdspring1 commented Aug 22, 2024 •

edited

Loading

csarofeen commented Sep 8, 2024

jacobhinkle left a comment

rdspring1 commented Sep 23, 2024

zasdfgbnm commented Sep 23, 2024

rdspring1 commented Sep 26, 2024

zasdfgbnm left a comment

zasdfgbnm Sep 27, 2024

rdspring1 commented Sep 29, 2024

Add TMA support for circular buffering pass #2833

Add TMA support for circular buffering pass #2833

Conversation

rdspring1 commented Aug 22, 2024 • edited Loading

Summary

Description

Lowering Details

Loop Structure

Pre-prologue loop:

Prologue loop:

Main loop:

Epilogue loop:

Post-epilogue loop:

Pre-Prologue loop:

Prologue loop:

Main loop:

Post-Epilogue loop:

Testing Setup

csarofeen commented Sep 8, 2024

jacobhinkle left a comment

Choose a reason for hiding this comment

rdspring1 commented Sep 23, 2024

zasdfgbnm commented Sep 23, 2024

rdspring1 commented Sep 26, 2024

zasdfgbnm left a comment

Choose a reason for hiding this comment

zasdfgbnm Sep 27, 2024

Choose a reason for hiding this comment

rdspring1 commented Sep 29, 2024

rdspring1 commented Aug 22, 2024 •

edited

Loading