Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TMA support for circular buffering pass #2833

Merged
merged 40 commits into from
Sep 29, 2024
Merged

Add TMA support for circular buffering pass #2833

merged 40 commits into from
Sep 29, 2024

Conversation

rdspring1
Copy link
Collaborator

@rdspring1 rdspring1 commented Aug 22, 2024

Summary

This PR adds support for TMA circular buffering. It is stacked on #2824 and #2825.
Tracking branch: #2773

Description

  • The Pre-Prologue and Post-Epilogue loops are created in the allocation pass.
  • Pre-Prologue loop allocates share memory and initializes mbarriers, while Post-Epilogue loop invalidates mbarriers.
  • In the circular buffer pass, CloneTmaCircularBufferLoopAndInsertSync clones operations and inserts mbarrier synchronization logic to create the prologue, main, and epilogue for-loops.
  • Prologue copies only the load operations. arriveExpectTx and arrive expressions are created for cpAsyncBulk load operations.
  • Main loop copies the load and computation operations, adds arriveExpectedTx and arrive for next stage, and calls mbarrierWait for current stage.
  • Epilogue copies only the computation operations and adds mbarrierWait for remaining stages in the pipeline.

Lowering Details

Description of changes in lowering passes.

  • Prologue, Main, and Epilogue loops are created by CloneTmaCircularBufferLoopAndInsertSync which is a child class of CircularBufferLoopCloner.
  • PrePrologue and PostEpilogue loops are created in the allocation pass.
  • The cuTensorMapEncodeTiled restricts the size of each box dimension to be <= 256. You will need to launch multiple load operations to load larger tiles.
  • We only allocate mbarriers for each stage, so the expected_transaction bytes is multiplied by the number of TMA loads per stage.
  • The for-loop cloner must account for the nested for-loop structure used to launch multiple TMA loads before adding the mbarrier_wait for the stage.

Loop Structure

Description of for-loop structure for circular buffering.

Overview Circular Buffer Structure:

Pre-prologue loop:

  • Allocate shared memory for mbarriers and mbarrier tokens
  • Initialize mbarrier for all stages

Prologue loop:

  • if selected_thread:
    • Issue cp async bulks for all but last stage

Main loop:

  • if selected_thread:
    • Issue next cp async bulk for available stage
  • All threads wait until tma operation arrives
  • Copy body without
    • shared memory allocations
    • mbarrier_init exprs
    • mbarrier_inval exprs

Epilogue loop:

  • All threads wait until tma operation arrives
  • Copy body without
    • shared memory allocations
    • issuing cp async bulk operations
    • mbarrier_init exprs
    • mbarrier_inval exprs

Post-epilogue loop:

  • if selected_thread:
    • Invalidated mbarrier for all stages
Detailed Pseudo-Code:
constexpr int64_t warp_size = 32;
bool first_warp = threadIdx.x < warp_size && threadIdx.y == 0 && threadIdx.z == 0;

Pre-Prologue loop:

__shared__ __mbarrier_t barriers[num_stages];
__shared__ __mbarrier_token_t tokens[num_stages];
for (int64_t loop_index : irange(stages)) {
  if (first_warp && hopper::electSync()) {
    mbarrier_init(mbarrier[loop_index], number_of_arrival_threads);
  }
}

Prologue loop:

// Launch loads for the first stages-1
for (int64_t loop_index : irange(stages-1)) {
  if (first_warp && hopper::electSync()) {
    tokens[loop_index] = mbarrier::arriveExpectTx(mbarrier[loop_index]);
    cpAsyncBulk(mbarriers[loop_index], ...);
  } else {
    token[load_stage] = mbarrier::arrive(mbarrier[load_stage]);
  }
}

Main loop:

// Launch load for last available stage. Wait for the current stage in pipeline.
// Repeat for extent - (stages-1) iterations
for (int64_t loop_index : irange(N-(stages-1))) {
  current_stage = loop_index % stage_depth
  load_stage = (loop_index + (stage_depth - 1)) % stage_depth)
  if (first_warp && hopper::electSync()) {
    token[load_stage] =
      mbarrier::arriveExpectTx(mbarrier[load_stage], expected_transaction_size);
    cpAsyncBulk(mbarrier[load_stage], ...);
  } else {
    token[load_stage] = mbarrier::arrive(mbarrier[load_stage]);
  }
  mbarrier::wait(token[current_stage]);

  // Clone remaining operations
}

Epilogue loop:

// Wait for current stage in pipeline. Repeat for remaining iterations in extent.
for (int64_t loop_index : irange(N-(stages-1), N)) {
  current_stage = loop_index % stage_depth
  mbarrier::wait(token[current_stage]);

  // Clone remaining operations
}

Post-Epilogue loop:

for (int64_t loop_index : irange(stages)) {
  if (first_warp && hopper::electSync()) {
    mbarrier_inval(mbarrier[loop_index]);
  }
}

Testing Setup

  • 2 to 4 pipeline stages.
  • (128, 500, 1024) outer dimension.
  • (128, 1024) inner dimension.
  1. Single Dim including Unroll and Unswitch parallelizations.
  2. Multiple Dim
  3. Pointwise
  • One Tensor is loaded with TMA circular buffering. The other tensor is loaded with Set circular buffering.
  1. PointwiseCpAsync
  • One Tensor is loaded with TMA circular buffering. The other tensor is loaded with CpAsync circular buffering. This test is currently disabled, but will be fixed by async copy save registers #2339.
  1. Reduction
  2. InnerPersistent
  • In this schedule, the output TensorView of the cpAsyncBulk load has a serial iterDomain to the right of computeAt position. A for-loop will launch multiple TMA loads for each pipeline stage.
  1. Matmul

@rdspring1 rdspring1 changed the title Add TMA support for circular buffering pass and testing Add TMA support for circular buffering pass Aug 22, 2024
@csarofeen
Copy link
Collaborator

Awesome, detailed PR description. Thank you.

* Add support for Hopper::electSync
* Create ElectSync PredicateType
* Make mbarrier synchronous
  * mbarrier waits for all threads in CTA
  * All threads issues arriveExpectTx to get mbarrier_token
Copy link
Collaborator

@jacobhinkle jacobhinkle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor comments from a first pass. I haven't looked at tests yet.

csrc/device_lower/pass/allocation.cpp Outdated Show resolved Hide resolved
csrc/device_lower/pass/circular_buffer.cpp Outdated Show resolved Hide resolved
csrc/device_lower/pass/circular_buffer.cpp Show resolved Hide resolved
csrc/device_lower/pass/circular_buffer.cpp Outdated Show resolved Hide resolved
csrc/device_lower/pass/circular_buffer.cpp Outdated Show resolved Hide resolved
csrc/device_lower/pass/circular_buffer.cpp Outdated Show resolved Hide resolved
csrc/executor.cpp Outdated Show resolved Hide resolved
@rdspring1
Copy link
Collaborator Author

Do we expect to mix tma with cp.async in a kernel?

The pointwise, reduction, and persistent fusions in my tests do not work with cp.async regardless of circular buffering usage. It seems unusual but I don't see any NVF_ERRORs besides incorrect results.

@zasdfgbnm
Copy link
Collaborator

Do we expect to mix tma with cp.async in a kernel?

No, we do not expect that. I just want to understand what will be the behavior (will it just work? if not, are we correctly throwing errors saying that this is not supported?).

csrc/device_lower/pass/circular_buffer.cpp Outdated Show resolved Hide resolved
csrc/device_lower/pass/circular_buffer.cpp Show resolved Hide resolved
csrc/device_lower/pass/circular_buffer.cpp Show resolved Hide resolved
csrc/device_lower/pass/circular_buffer.cpp Show resolved Hide resolved
csrc/device_lower/pass/circular_buffer.cpp Outdated Show resolved Hide resolved
csrc/device_lower/pass/circular_buffer.cpp Show resolved Hide resolved
csrc/device_lower/pass/circular_buffer.cpp Outdated Show resolved Hide resolved
* Create `TmaCircularBufferInfo` struct to consolidate data fields for TMA circular buffering.
* Move shared memory allocations outside of circular buffering loop
* Remove GatherMBarrierAllocations
@rdspring1 rdspring1 requested a review from zasdfgbnm September 26, 2024 02:18
csrc/device_lower/pass/allocation.cpp Outdated Show resolved Hide resolved
csrc/device_lower/pass/allocation.cpp Outdated Show resolved Hide resolved
csrc/device_lower/pass/allocation.cpp Outdated Show resolved Hide resolved
@rdspring1
Copy link
Collaborator Author

!build

@rdspring1 rdspring1 requested a review from zasdfgbnm September 26, 2024 23:11
Copy link
Collaborator

@zasdfgbnm zasdfgbnm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the great work and discussion!

Comment on lines +1094 to +1098
if (hasCpAsyncBulk) {
insertTma(loop, it->second);
} else {
insert(loop, it->second);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried TEST_P(TmaCircularBufferingTest, Pointwise) without any circular buffering loaded TV0 with TMA and TV1 with cp.async and got incorrect results.

So TV0 is not circular buffered, and TV1 is?

Should we check that, all circular buffer load should have the same LoadStoreOpType?

csrc/device_lower/pass/circular_buffer.cpp Outdated Show resolved Hide resolved
csrc/device_lower/pass/circular_buffer.cpp Show resolved Hide resolved
csrc/device_lower/pass/circular_buffer.cpp Show resolved Hide resolved
@rdspring1
Copy link
Collaborator Author

!build

@rdspring1 rdspring1 merged commit 2cee59d into main Sep 29, 2024
39 of 41 checks passed
@rdspring1 rdspring1 deleted the tma_cb branch September 29, 2024 19:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants