Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wgmma] Insert commit_group and wait_group after mma_async #3573

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

jacobhinkle
Copy link
Collaborator

@jacobhinkle jacobhinkle commented Dec 11, 2024

We need to wait for the MMA to complete before arriving at the circular buffer mbarrier, to avoid overwriting the inputs while the wgmma is being computed. This PR inserts a commit and wait just after the wgmma, which makes the wgmma synchronous.

Note that so far I have not removed the WAR sync insertion for wgmma which is no longer needed.

Fixes #3561.

@jacobhinkle
Copy link
Collaborator Author

!test

@rdspring1
Copy link
Collaborator

@jacobhinkle I ran into #3561 when working with register sharing and warp specialization. I wonder if I can use this.

Copy link
Collaborator

@rdspring1 rdspring1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can be multiple wgmma operations per circular buffer stage. e.g., when the k dimension is a multiple of mma macro.

It should be safe to wait for all operations per stage.

I made the following change in WarAsyncWaitInserter.
https://github.com/NVIDIA/Fuser/pull/3616/files#diff-49bea61a8cde014ec0396c89d0654813136e0bd3ffe8b3a5974ee9ccf3a5fbb8R1010-R1024

It didn't any performance difference. 🤷🏼

csrc/device_lower/pass/inline_ptx.cpp Outdated Show resolved Hide resolved
@jacobhinkle jacobhinkle marked this pull request as ready for review January 13, 2025 13:33
@jacobhinkle
Copy link
Collaborator Author

!test

@jacobhinkle
Copy link
Collaborator Author

!test

@jacobhinkle jacobhinkle marked this pull request as draft January 13, 2025 15:38
@jacobhinkle
Copy link
Collaborator Author

!test

@jacobhinkle
Copy link
Collaborator Author

Looks like the MmaTest/HopperRS.SingleTile/Hopper*NoSwizzle* tests are failing due to invalid reads from smem.

@jacobhinkle
Copy link
Collaborator Author

Looks like the MmaTest/HopperRS.SingleTile/Hopper*NoSwizzle* tests are failing due to invalid reads from smem.

These are failing the compute-sanitizer check. This happens because we have inserted the WAR sync twice. This PR adds the sync right after the wgmma, essentially making it synchronous. That means there is no need to also add a WAR sync for the wgmma.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Missing sync in matmul kernel
2 participants