Missing sync in matmul kernel #3561

zasdfgbnm · 2024-12-10T20:13:43Z

I just noticed that we are missing the wgmma.commit_group and wgmma.wait before the WAR arrive in the compute warp.

I have no idea why we do not see any wrong result error.

Manually adding it back does not seem to hurt the perf as experimented in #3560

The text was updated successfully, but these errors were encountered:

zasdfgbnm · 2024-12-10T20:15:47Z

cc: @naoyam @rdspring1 @jacobhinkle

jacobhinkle · 2024-12-10T20:50:30Z

I assume compute-sanitizer doesn't find a race in this case?

zasdfgbnm · 2024-12-10T22:12:19Z

I assume compute-sanitizer doesn't find a race in this case?

We do have a sanitizer pipeline, and nothing failed. So I believe the timeline is just happen to perfectly avoid races even without sync.

jacobhinkle · 2024-12-11T14:25:59Z

Relevant docs:

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#asynchronous-warpgroup-level-matrix-instructions

The wgmma instructions perform warpgroup level matrix multiply-and-accumulate operation by having all threads in a warpgroup collectively perform the following actions:

Load matrices A, B and D into registers or into shared memory.

Perform the following fence operations:
- wgmma.fence operations to indicate that the register/shared-memory across the warpgroup have been written into.
- fence.proxy.async operation to make the generic proxy operations visible to the async proxy.

Issue the asynchronous matrix multiply and accumulate operations using the wgmma.mma_async operation on the input matrices. The wgmma.mma_async operation is performed in the async proxy.

Create a wgmma-group and commit all the prior outstanding wgmma.mma_async operations into the group, by using wgmma.commit_group operation.

Wait for the completion of the required wgmma-group.

Once the wgmma-group completes, all the wgmma.mma_async operations have been performed and completed.

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#asynchronous-warpgroup-level-matrix-instructions-wgmma-fence

The wgmma.fence instruction must be issued by all warps of the warpgroup at the following locations:

Before the first wgmma.mma_async operation in a warpgroup.

Between a register access by a thread in the warpgroup and any wgmma.mma_async instruction that accesses the same registers, either as accumulator or input register containing fragments of matrix A, except when these are accumulator register accesses across multiple wgmma.mma_async instructions of the same shape. In the latter case, an ordering guarantee is provided by default.

Otherwise, the behavior is undefined.

zasdfgbnm added the Matmuls label Dec 10, 2024

zasdfgbnm mentioned this issue Dec 10, 2024

Add warp specialization as a circular buffering type #3511

Merged

zasdfgbnm linked a pull request Dec 11, 2024 that will close this issue

[do not merge] Enable smem-epilogue and manual fix sync #3563

Draft

jacobhinkle linked a pull request Dec 11, 2024 that will close this issue

[wgmma] Insert commit_group and wait_group after mma_async #3573

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing sync in matmul kernel #3561

Missing sync in matmul kernel #3561

zasdfgbnm commented Dec 10, 2024

zasdfgbnm commented Dec 10, 2024

jacobhinkle commented Dec 10, 2024

zasdfgbnm commented Dec 10, 2024

jacobhinkle commented Dec 11, 2024

Missing sync in matmul kernel #3561

Missing sync in matmul kernel #3561

Comments

zasdfgbnm commented Dec 10, 2024

zasdfgbnm commented Dec 10, 2024

jacobhinkle commented Dec 10, 2024

zasdfgbnm commented Dec 10, 2024

jacobhinkle commented Dec 11, 2024