-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[wgmma] Insert commit_group and wait_group after mma_async #3573
base: main
Are you sure you want to change the base?
Conversation
!test |
@jacobhinkle I ran into #3561 when working with register sharing and warp specialization. I wonder if I can use this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There can be multiple wgmma
operations per circular buffer stage. e.g., when the k dimension is a multiple of mma macro.
It should be safe to wait for all operations per stage.
I made the following change in WarAsyncWaitInserter
.
https://github.com/NVIDIA/Fuser/pull/3616/files#diff-49bea61a8cde014ec0396c89d0654813136e0bd3ffe8b3a5974ee9ccf3a5fbb8R1010-R1024
It didn't any performance difference. 🤷🏼
Co-authored-by: Ryan Spring <[email protected]>
!test |
!test |
!test |
Looks like the |
These are failing the |
We need to wait for the MMA to complete before arriving at the circular buffer mbarrier, to avoid overwriting the inputs while the wgmma is being computed. This PR inserts a commit and wait just after the wgmma, which makes the wgmma synchronous.
Note that so far I have not removed the WAR sync insertion for wgmma which is no longer needed.
Fixes #3561.