[wgmma] Insert commit_group and wait_group after mma_async #3573

jacobhinkle · 2024-12-11T14:21:39Z

We need to wait for the MMA to complete before arriving at the circular buffer mbarrier, to avoid overwriting the inputs while the wgmma is being computed. This PR inserts a commit and wait just after the wgmma, which makes the wgmma synchronous.

Note that so far I have not removed the WAR sync insertion for wgmma which is no longer needed.

Fixes #3561.

jacobhinkle · 2024-12-11T17:35:42Z

!test

rdspring1 · 2025-01-03T02:00:01Z

@jacobhinkle I ran into #3561 when working with register sharing and warp specialization. I wonder if I can use this.

rdspring1

There can be multiple wgmma operations per circular buffer stage. e.g., when the k dimension is a multiple of mma macro.

It should be safe to wait for all operations per stage.

I made the following change in WarAsyncWaitInserter.
https://github.com/NVIDIA/Fuser/pull/3616/files#diff-49bea61a8cde014ec0396c89d0654813136e0bd3ffe8b3a5974ee9ccf3a5fbb8R1010-R1024

It didn't any performance difference. 🤷🏼

csrc/device_lower/pass/inline_ptx.cpp

Co-authored-by: Ryan Spring <[email protected]>

jacobhinkle · 2025-01-13T13:33:51Z

!test

jacobhinkle · 2025-01-13T14:10:00Z

!test

jacobhinkle · 2025-01-13T15:41:23Z

!test

jacobhinkle · 2025-01-13T17:25:59Z

Looks like the MmaTest/HopperRS.SingleTile/Hopper*NoSwizzle* tests are failing due to invalid reads from smem.

jacobhinkle · 2025-01-16T13:41:45Z

Looks like the MmaTest/HopperRS.SingleTile/Hopper*NoSwizzle* tests are failing due to invalid reads from smem.

These are failing the compute-sanitizer check. This happens because we have inserted the WAR sync twice. This PR adds the sync right after the wgmma, essentially making it synchronous. That means there is no need to also add a WAR sync for the wgmma.

[wgmma] Insert commit_group and wait_group after mma_async

e6ea681

jacobhinkle requested a review from zasdfgbnm December 11, 2024 14:21

jacobhinkle added the Matmuls label Dec 11, 2024

rdspring1 reviewed Jan 3, 2025

View reviewed changes

csrc/device_lower/pass/inline_ptx.cpp Outdated Show resolved Hide resolved

jacobhinkle and others added 2 commits January 13, 2025 08:32

Update csrc/device_lower/pass/inline_ptx.cpp

16c7440

Co-authored-by: Ryan Spring <[email protected]>

Insert comment

3c259f6

jacobhinkle marked this pull request as ready for review January 13, 2025 13:33

jacobhinkle added 3 commits January 13, 2025 08:59

Merge remote-tracking branch 'origin/main' into wgmma_insert_waits

637d372

Merge remote-tracking branch 'origin/main' into wgmma_insert_waits

fb804cd

Fix compile error

cc58fc6

jacobhinkle marked this pull request as draft January 13, 2025 15:38

s/min/max/

085a76c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[wgmma] Insert commit_group and wait_group after mma_async #3573

[wgmma] Insert commit_group and wait_group after mma_async #3573

jacobhinkle commented Dec 11, 2024 •

edited

Loading

jacobhinkle commented Dec 11, 2024

rdspring1 commented Jan 3, 2025

rdspring1 left a comment •

edited

Loading

jacobhinkle commented Jan 13, 2025

jacobhinkle commented Jan 13, 2025

jacobhinkle commented Jan 13, 2025

jacobhinkle commented Jan 13, 2025

jacobhinkle commented Jan 16, 2025

[wgmma] Insert commit_group and wait_group after mma_async #3573

Are you sure you want to change the base?

[wgmma] Insert commit_group and wait_group after mma_async #3573

Conversation

jacobhinkle commented Dec 11, 2024 • edited Loading

jacobhinkle commented Dec 11, 2024

rdspring1 commented Jan 3, 2025

rdspring1 left a comment • edited Loading

Choose a reason for hiding this comment

jacobhinkle commented Jan 13, 2025

jacobhinkle commented Jan 13, 2025

jacobhinkle commented Jan 13, 2025

jacobhinkle commented Jan 13, 2025

jacobhinkle commented Jan 16, 2025

jacobhinkle commented Dec 11, 2024 •

edited

Loading

rdspring1 left a comment •

edited

Loading