Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allocation changes for TMA Circular Buffering #2824

Merged
merged 3 commits into from
Sep 4, 2024
Merged

Allocation changes for TMA Circular Buffering #2824

merged 3 commits into from
Sep 4, 2024

Conversation

rdspring1
Copy link
Collaborator

@rdspring1 rdspring1 commented Aug 21, 2024

Summary

It is the changes to the allocation lowering pass from #2773.

Details

GpuLower

  • ldst_mbarrier_token_map_ maps LoadStoreOp to mbarrier tokens, which are represented as TensorView of number of pipeline stages.
  • mbarrier_token_smem_alloc_set_ tracks the kir::Allocate expressions for the mbarriers and their tokens.
  • ldst_mbarrier_index_map_ maps the cloned LoadStoreOp in the prologue and main loops to their indexed mbarrier.

Allocation

  • In the allocation pass, create shared memory allocations and operations around LoadStoreOp expression.
// Created tokens, mbarriers, init, and inval operations in allocation pass.
for (circular_buffer_loop) {
    __shared__ int64_t tokens[num_stages];
    __shared__ int64_t mbarrier[num_stages];
    init(mbarrier);
    cp.async.bulk(data, mbarrier);
    inval(mbarrier);
}

AliasMemory

  • The mbarrier and its token are mapped together. The token is the mbarrier state of the last phase. For simplicity, mark token liveness when mbarrier is initialized and invalidated.
  • Apply markWrite for mbarrier and its token when the expression is MBarrierInit
  • Apply markRead for mbarrier and its token when the expression is MBarrierInvalidate

Comment on lines +388 to +399
// Keep track of placeholders for tokens returned by arrive/expected tx
// mbarrier operations for each load/store operation that requires such
// synchronization
std::unordered_map<const Expr*, TensorView*> ldst_mbarrier_token_map_;

// Collection of kir::Allocate for smem buffers used for mbarrier and token
// objects from cpAsyncBulk synchronization
std::unordered_set<const Expr*> mbarrier_token_smem_alloc_set_;

// Keep track what mbarrier object is used in load/store operation that
// requires such synchronization, required by indexing pass
std::unordered_map<const Expr*, kir::TensorIndex*> ldst_mbarrier_index_map_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we put them together as a struct MBarrierInfo?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make this change in a separate PR after #2824 and #2825 but before #2833.

@zasdfgbnm
Copy link
Collaborator

  • In the allocation pass, create shared memory allocations and operations around LoadStoreOp expression.
// Created tokens, mbarriers, init, and inval operations in allocation pass.
for (circular_buffer_loop) {
    __shared__ tokens[num_stages];
    __shared__ mbarrier[num_stages];
    init(mbarrier);
    cp.async.bulk(data, mbarrier);
    inval(mbarrier);
}

Could you add this as a comment in the code?

csrc/device_lower/pass/allocation.cpp Outdated Show resolved Hide resolved
.build();
mbarrier->setMemoryType(MemoryType::Shared);

// The wait condition for mbarrier is a single thread and the expected
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of assuming there will be only a single thread issuing TMA load, should we check the parallelization of expr->output(0) to determine that value?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of threads needs to a constant when we call arrive_expected_tx in the cuda kernel. The number of threads can be a runtime value, so defaulting to a single thread is the simplest thing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using a regular block_sync for synchronizing the CTA rather than the mbarrier because of this compile-time constraint.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand. Could you elaborate more on:

The number of threads needs to a constant when we call arrive_expected_tx in the cuda kernel.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of threads needs to a constant when we call arrive_expected_tx in the cuda kernel.

I was wrong about this ^^^.

I think your proposal is this:

kir::MBarrierInit* mbarrier_init = IrBuilder::create<kir::MBarrierInit>(
    mbarrier,
    simplifyExpr(SimplifyingIrBuilder::maybeCastExpr(
        DataType::UInt32,
        lower_utils::getNumThreadsInTensorView(
            expr->output(0)->as<TensorView>()))));

I'm fine with this change.

Copy link
Collaborator

@naoyam naoyam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I defer to @zasdfgbnm.

@rdspring1
Copy link
Collaborator Author

!build

Copy link
Collaborator

@zasdfgbnm zasdfgbnm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stamping. I don't think there is anything in this PR that can not be resolved in a separate PR.

@rdspring1
Copy link
Collaborator Author

!build

@rdspring1 rdspring1 merged commit bcdd181 into main Sep 4, 2024
36 checks passed
@rdspring1 rdspring1 deleted the tma_cb_alloc branch September 4, 2024 16:42
rdspring1 added a commit that referenced this pull request Sep 5, 2024
## Summary ##
It is the changes to the indexing lowering pass from
#2773. It is stacked on #2824.
Tracking Branch: #2773

## Details ##
- In the circular buffering pass, we manually index the mbarriers and
tokens using the index of the prologue, main, and epilogue loops.
```cpp
for (int index : c10::irange(fl->extent()) {
  int stage = index % number_of_pipeline_stages;
  mbarrier_t current_stage_mbarrier = mbarriers[stage];  // represented with kir::TensorIndex

  int next_stage = (index + number_of_stages - 1) % number_of_pipeline_stages;
  mbarrier_t next_stage_mbarrier = mbarriers[next_stage];  // represented with kir::TensorIndex
}
```
- The handle functions for `kir::MBarrierInit`,
`kir::MBarrierInvalidate`, `kir::MBarrierArriveExpectTx`, and
`kir::MBarrierWait` are modified to handle `kir::TensorIndex`.
- `u32IndexScalarSmemTv` is modified to get the shared memory pointer
address for a `kir::TensorIndex`.
rdspring1 added a commit that referenced this pull request Sep 29, 2024
## Summary ##
This PR adds support for TMA circular buffering. It is stacked on #2824
and #2825.
Tracking branch: #2773

## Description ##
- The Pre-Prologue and Post-Epilogue loops are created in the allocation
pass.
- Pre-Prologue loop allocates share memory and initializes mbarriers,
while Post-Epilogue loop invalidates mbarriers.
- In the circular buffer pass, `CloneTmaCircularBufferLoopAndInsertSync`
clones operations and inserts mbarrier synchronization logic to create
the prologue, main, and epilogue for-loops.
- Prologue copies only the load operations. `arriveExpectTx` and
`arrive` expressions are created for cpAsyncBulk load operations.
- Main loop copies the load and computation operations, adds
`arriveExpectedTx` and `arrive` for next stage, and calls `mbarrierWait`
for current stage.
- Epilogue copies only the computation operations and adds
`mbarrierWait` for remaining stages in the pipeline.


## Lowering Details ##
Description of changes in lowering passes.
- `Prologue`, `Main`, and `Epilogue` loops are created by
`CloneTmaCircularBufferLoopAndInsertSync` which is a child class of
`CircularBufferLoopCloner`.
- `PrePrologue` and `PostEpilogue` loops are created in the allocation
pass.
- The `cuTensorMapEncodeTiled ` restricts the size of each box dimension
to be `<= 256`. You will need to launch multiple load operations to load
larger tiles.
- We only allocate `mbarriers` for each stage, so the
`expected_transaction` bytes is multiplied by the number of TMA loads
per stage.
- The for-loop cloner must account for the nested for-loop structure
used to launch multiple TMA loads before adding the `mbarrier_wait` for
the stage.


## Loop Structure ##
Description of for-loop structure for circular buffering.

<details>
<summary>Overview Circular Buffer Structure:</summary>

### Pre-prologue loop: ###
- Allocate shared memory for mbarriers and mbarrier tokens
- Initialize mbarrier for all stages

### Prologue loop: ###
- if selected_thread:
  - Issue cp async bulks for all but last stage

### Main loop: ###
- if selected_thread:
  - Issue next cp async bulk for available stage
- All threads wait until tma operation arrives
- Copy body without
  - shared memory allocations
  - mbarrier_init exprs
  - mbarrier_inval exprs

### Epilogue loop: ###
- All threads wait until tma operation arrives
- Copy body without
  - shared memory allocations
  - issuing cp async bulk operations
  - mbarrier_init exprs
  - mbarrier_inval exprs

### Post-epilogue loop: ###
- if selected_thread:
  - Invalidated mbarrier for all stages
</details>

<details>
<summary>Detailed Pseudo-Code:</summary>

```cpp
constexpr int64_t warp_size = 32;
bool first_warp = threadIdx.x < warp_size && threadIdx.y == 0 && threadIdx.z == 0;
```

### Pre-Prologue loop: ###
```cpp
__shared__ __mbarrier_t barriers[num_stages];
__shared__ __mbarrier_token_t tokens[num_stages];
for (int64_t loop_index : irange(stages)) {
  if (first_warp && hopper::electSync()) {
    mbarrier_init(mbarrier[loop_index], number_of_arrival_threads);
  }
}
```

### Prologue loop: ###
```cpp
// Launch loads for the first stages-1
for (int64_t loop_index : irange(stages-1)) {
  if (first_warp && hopper::electSync()) {
    tokens[loop_index] = mbarrier::arriveExpectTx(mbarrier[loop_index]);
    cpAsyncBulk(mbarriers[loop_index], ...);
  } else {
    token[load_stage] = mbarrier::arrive(mbarrier[load_stage]);
  }
}
```

### Main loop: ###
```cpp
// Launch load for last available stage. Wait for the current stage in pipeline.
// Repeat for extent - (stages-1) iterations
for (int64_t loop_index : irange(N-(stages-1))) {
  current_stage = loop_index % stage_depth
  load_stage = (loop_index + (stage_depth - 1)) % stage_depth)
  if (first_warp && hopper::electSync()) {
    token[load_stage] =
      mbarrier::arriveExpectTx(mbarrier[load_stage], expected_transaction_size);
    cpAsyncBulk(mbarrier[load_stage], ...);
  } else {
    token[load_stage] = mbarrier::arrive(mbarrier[load_stage]);
  }
  mbarrier::wait(token[current_stage]);

  // Clone remaining operations
}
```

Epilogue loop:
```cpp
// Wait for current stage in pipeline. Repeat for remaining iterations in extent.
for (int64_t loop_index : irange(N-(stages-1), N)) {
  current_stage = loop_index % stage_depth
  mbarrier::wait(token[current_stage]);

  // Clone remaining operations
}
```

### Post-Epilogue loop: ###
```cpp
for (int64_t loop_index : irange(stages)) {
  if (first_warp && hopper::electSync()) {
    mbarrier_inval(mbarrier[loop_index]);
  }
}
```
</details>

## Testing Setup ##
- 2 to 4 pipeline stages.
- (128, 500, 1024) outer dimension.
- (128, 1024) inner dimension.

1. Single Dim including Unroll and Unswitch parallelizations.
2. Multiple Dim
3. Pointwise 
- One Tensor is loaded with TMA circular buffering. The other tensor is
loaded with Set circular buffering.
4. PointwiseCpAsync 
- One Tensor is loaded with TMA circular buffering. The other tensor is
loaded with CpAsync circular buffering. This test is currently disabled,
but will be fixed by #2339.
5. Reduction
6. InnerPersistent 
- In this schedule, the output TensorView of the cpAsyncBulk load has a
serial iterDomain to the right of computeAt position. A for-loop will
launch multiple TMA loads for each pipeline stage.
7. Matmul
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants