Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing sync in matmul kernel #3561

Open
zasdfgbnm opened this issue Dec 10, 2024 · 4 comments · May be fixed by #3563 or #3573
Open

Missing sync in matmul kernel #3561

zasdfgbnm opened this issue Dec 10, 2024 · 4 comments · May be fixed by #3563 or #3573
Labels

Comments

@zasdfgbnm
Copy link
Collaborator

I just noticed that we are missing the wgmma.commit_group and wgmma.wait before the WAR arrive in the compute warp.

I have no idea why we do not see any wrong result error.

Manually adding it back does not seem to hurt the perf as experimented in #3560

@zasdfgbnm
Copy link
Collaborator Author

cc: @naoyam @rdspring1 @jacobhinkle

@jacobhinkle
Copy link
Collaborator

I assume compute-sanitizer doesn't find a race in this case?

@zasdfgbnm
Copy link
Collaborator Author

I assume compute-sanitizer doesn't find a race in this case?

We do have a sanitizer pipeline, and nothing failed. So I believe the timeline is just happen to perfectly avoid races even without sync.

@jacobhinkle
Copy link
Collaborator

Relevant docs:

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#asynchronous-warpgroup-level-matrix-instructions

The wgmma instructions perform warpgroup level matrix multiply-and-accumulate operation by having all threads in a warpgroup collectively perform the following actions:

  1. Load matrices A, B and D into registers or into shared memory.
  2. Perform the following fence operations:
    - wgmma.fence operations to indicate that the register/shared-memory across the warpgroup have been written into.
    - fence.proxy.async operation to make the generic proxy operations visible to the async proxy.
  3. Issue the asynchronous matrix multiply and accumulate operations using the wgmma.mma_async operation on the input matrices. The wgmma.mma_async operation is performed in the async proxy.
  4. Create a wgmma-group and commit all the prior outstanding wgmma.mma_async operations into the group, by using wgmma.commit_group operation.
  5. Wait for the completion of the required wgmma-group.
  6. Once the wgmma-group completes, all the wgmma.mma_async operations have been performed and completed.

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#asynchronous-warpgroup-level-matrix-instructions-wgmma-fence

The wgmma.fence instruction must be issued by all warps of the warpgroup at the following locations:

  • Before the first wgmma.mma_async operation in a warpgroup.
  • Between a register access by a thread in the warpgroup and any wgmma.mma_async instruction that accesses the same registers, either as accumulator or input register containing fragments of matrix A, except when these are accumulator register accesses across multiple wgmma.mma_async instructions of the same shape. In the latter case, an ordering guarantee is provided by default.

Otherwise, the behavior is undefined.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants