Create ElectSync predicate type #2923

rdspring1 · 2024-09-09T23:39:29Z

Summary

The PR creates the ElectSync predicate type to select a single thread in an if-then-else block.

Why?

The standard predicate threadIdx.x == 0 && threadIdx.y == 0 && threadIdx.z == 0 selects a specific thread from a CTA, which can create a peeling loop. The elect.sync ptx will select an arbitrary thread from the a given warp.

Lowering Details

Create ElectSync PredicateType
Add ElectSync as a unary op.
Enforce that blockDim.x has at least one warp because the default membermask is 0xFFFFFFFF.

NvFuser's ElectSync Predicate

int warp_idx_zero = threadIdx.x < 32 && threadIdx.y == 0 && threadIdx.z == 0;
int lane_predicate = hopper::electSync(/*membermask=*/0xFFFFFFFF);
if (warp_idx_zero && lane_predicate) {
  // Do Something
}

Note: It could be more efficient to use __shfl_sync to get warp_idx like CUTLASS.

How CUTLASS selects a leader thread to issue TMA instructions

int warp_idx = __shfl_sync(0xffffffff, threadIdx.x / 32, 0);
int lane_predicate = hopper::electSync(/*membermask=*/0xFFFFFFFF);

if ((warp_idx == 0) && lane_predicate) {
  // Issue Tma Descriptor Prefetch from a single thread
}

naoyam · 2024-09-11T19:12:32Z

Why not exactly following what CUTLAS does?

naoyam · 2024-09-11T19:22:32Z

I'd drop Sync from the name. It implies the warp synchronous instruction execution. Here, it's just electing one thread from a block. Maybe ElectThread?

I don't think it's necessary at this moment, but to make it more future proof, we may also want to consider how this predicate could be extended to support electing one execution entity within each parallel type. Again, just a thought, not important right now.

csrc/device_lower/utils.h

rdspring1 · 2024-09-11T20:30:51Z

Why not exactly following what CUTLASS does?

Cutlass already knows which dimensions of the block are zero, so they can skip the extra predicates.

consider how this predicate could be extended to support electing one execution entity within each parallel type.

I agree. Cutlass does select warps based on threadIdx.x and threadIdx.y.

Change name to ElectThread?

I don't have a strong opinion about changing the name to ElectThread. However, all threads within the member-mask are synchronized, so there is some synchronization with this PTX instruction.

The mandatory .sync qualifier indicates that elect causes the executing thread to wait until all threads in the member-mask execute the elect instruction before resuming execution.

Reference: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-elect-sync

zasdfgbnm · 2024-09-11T20:33:21Z

I think my biggest concern is, I don't think ElectSync is a separate predicate type. I would rather consider it as an optimization on existing predicate types. For example, if BDIMx is 32, and there is an IterDomain I0{5} parallelized in TIDx, then we will generate predicate threadIdx.x < 5, which can not be optimized as elec sync. Similarly, if I0's extent is 1, instead of 5, then we will have threadIdx.x < 1, which can be optimized as elect sync.

The reason I dislike having this ElectSync as a separate predicate type is because it effectively skips all our existing analyses on predication and just hard code the "select one thread", which would generate silent wrong result if the schedule does not imply "only one thread does the work". In the above example, if the extent of I0 is 5, then using the elec sync predicate type is just silently wrong.

I believe a better way to do this is: run through all the existing predicate analyses we already have in our system. If these analyses indicates that only one thread is selected, then optimize it as elec-sync.

What do you think? @naoyam

naoyam · 2024-09-11T20:51:17Z

I generally agree with @zasdfgbnm. Ideally, we should automatically use elect_sync whenever possible. We have "thread predicates", which indicates which threads should be allowed to execute. Maybe we could use it to pipe it through.

That said, I think at this point we should prioritize get things done as quickly as possible, so I'd prefer to have something working first and then reconsider the design.

naoyam · 2024-09-11T20:57:55Z

I don't have a strong opinion about changing the name to ElectThread. However, all threads within the member-mask are synchronized, so there is some synchronization with this PTX instruction.

That's true. Actually, I don't this is worth worrying too much. It's just a Kernel IR type, so not exposed to the user, so never mind.

zasdfgbnm · 2024-09-11T21:10:41Z

That said, I think at this point we should prioritize get things done as quickly as possible, so I'd prefer to have something working first and then reconsider the design.

I agree with this, but I am still feeling uncomfortable with silent wrong results.

Today, TMA, just like any other expressions in nvFuser, fully support being launched by multiple threads. It will generate predicates like if (threadIdx.x < 5) if the TIDx parallelized ID has extent of 5. And this behavior is well documented in our documents tma.md and test_tutorial.cpp.

But suddenly we introduced this arbitrary limitation that does not match with what has been documented and could provide silent wrong result. In the long term, I believe we should have as little special case as possible, which means, TMA should continue support being launched by multiple threads, and an arbitrary threadIdx.x < 1 should be capable of being optimized as elect sync. I think achieving this does not need extra work on coding, so the end PR should be as short as it is today, but it does require extra work on studying how our existing predicate generation work.

In the short term, I am not suggesting that this is the most important thing, but I still believe at least what is documented should match with what is implemented, which should also match with what is validated. Which means, I don't think we can just add a new predicate type and use it. We should at least add some validation that the schedule is compatible with this predicate type, and update the document about the limitation.

rdspring1 · 2024-09-11T21:43:19Z

We should at least add some validation that the schedule is compatible with this predicate type, and update the document about the limitation.

We can have a check so that we're not launching multiple threads with any block parallel dimension.

rdspring1 · 2024-09-11T21:52:31Z

Since TMA is an async operation, I do wonder if

if (threadIdx.x < 5) {
 // launch tma simultaneously
}

is better than

for (size_t i : irange(5)) {
  if (warp_idx == 0 && electSync()) {
    // launch tma individually
  }
}

Or vice versa? Probably not much difference.

This ^^^ isn't related to the PR. Just general curiosity.

naoyam · 2024-09-11T21:57:42Z

I just realized this was extracted from #2833, which I haven't yet looked at. This PR itself doesn't have any use of the new predicate type, so it's hard to discuss how safe it would be. Let me review #2833 first.

zasdfgbnm · 2024-09-11T22:04:33Z

Since TMA is an async operation, I do wonder if
if (threadIdx.x < 5) {
 // launch tma simultaneously
}
is better than
for (size_t i : irange(5)) {
  if (warp_idx == 0 && electSync()) {
    // launch tma individually
  }
}
Or vice versa? Probably not much difference.

This ^^^ isn't related to the PR. Just general curiosity.

I don't know honestly. I don't think there will be any first-order difference. But I won't be surprised if there are some second-order effects. For example, one variant uses more registers than the other, and this extra register bring the occupancy down from 2 to 1. Just want to have this flexibility to easily experiment different things.

zasdfgbnm · 2024-09-11T22:06:35Z

We can have a check so that we're not launching multiple threads with any block parallel dimension.

Yeah, adding a check is sufficient for now. Please also add a warning message to the doc mentioning that if you do circular buffering, you must launch TMA using one thread.

rdspring1 · 2024-09-16T20:29:43Z

!build

rdspring1 · 2024-09-16T20:33:06Z

I added the compatibility check for the ElectSync predicate.

Summary: In predicate lowering pass, for any If-Then-Else expression with ElectSync predicate, check that all output TensorViews do not have any block parallelization.

There isn't an expression that uses the ElectSync predicate in this PR, so I added the test to #2833. See TEST_F(NVFuserTest, ElectSyncCompatibility) in f29aa22.

rdspring1 · 2024-09-16T20:35:43Z

doc/dev/tma.md

@@ -349,6 +349,11 @@ the TMA domain can be completely inferred from the schedule.
 > We do not have validation on shared memory schedule yet.
 > If you scheduled something invalid, likely you will see misaligned address error or silent wrong result.

+> [!WARNING]


Perhaps there should be a circular buffering section but I added the warning here for now.

csrc/device_lower/pass/predicate.cpp

csrc/fusion_executor/executor.cpp

rdspring1 · 2024-09-17T20:08:05Z

!build

Create ElectSync predicate type

b6e181b

rdspring1 mentioned this pull request Sep 10, 2024

Add TMA support for circular buffering pass #2833

Merged

Merge branch 'main' into elect_sync

4543c57

rdspring1 marked this pull request as ready for review September 11, 2024 16:55

rdspring1 requested review from zasdfgbnm, jacobhinkle and naoyam September 11, 2024 18:42

naoyam reviewed Sep 11, 2024

View reviewed changes

csrc/device_lower/utils.h Outdated Show resolved Hide resolved

Add compatibility check for elect sync

630e473

rdspring1 commented Sep 16, 2024

View reviewed changes

zasdfgbnm approved these changes Sep 16, 2024

View reviewed changes

csrc/device_lower/pass/predicate.cpp Outdated Show resolved Hide resolved

csrc/fusion_executor/executor.cpp Outdated Show resolved Hide resolved

rdspring1 added 4 commits September 17, 2024 10:30

Merge branch 'main' of https://github.com/nvidia/fuser into elect_sync

2246767

comments

70e2997

Add has_elect_sync_predicate to kernel_summary

ee3fa03

fix

9161ac6

rdspring1 merged commit 21ab9ab into main Sep 18, 2024
34 of 36 checks passed

rdspring1 deleted the elect_sync branch September 18, 2024 16:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create ElectSync predicate type #2923

Create ElectSync predicate type #2923

rdspring1 commented Sep 9, 2024 •

edited

Loading

naoyam commented Sep 11, 2024

naoyam commented Sep 11, 2024

rdspring1 commented Sep 11, 2024

zasdfgbnm commented Sep 11, 2024 •

edited

Loading

naoyam commented Sep 11, 2024

naoyam commented Sep 11, 2024

zasdfgbnm commented Sep 11, 2024 •

edited

Loading

rdspring1 commented Sep 11, 2024

rdspring1 commented Sep 11, 2024 •

edited

Loading

naoyam commented Sep 11, 2024

zasdfgbnm commented Sep 11, 2024

zasdfgbnm commented Sep 11, 2024

rdspring1 commented Sep 16, 2024

rdspring1 commented Sep 16, 2024 •

edited

Loading

rdspring1 Sep 16, 2024

rdspring1 commented Sep 17, 2024

Create ElectSync predicate type #2923

Create ElectSync predicate type #2923

Conversation

rdspring1 commented Sep 9, 2024 • edited Loading

Summary

Why?

Lowering Details

NvFuser's ElectSync Predicate

How CUTLASS selects a leader thread to issue TMA instructions

naoyam commented Sep 11, 2024

naoyam commented Sep 11, 2024

rdspring1 commented Sep 11, 2024

zasdfgbnm commented Sep 11, 2024 • edited Loading

naoyam commented Sep 11, 2024

naoyam commented Sep 11, 2024

zasdfgbnm commented Sep 11, 2024 • edited Loading

rdspring1 commented Sep 11, 2024

rdspring1 commented Sep 11, 2024 • edited Loading

naoyam commented Sep 11, 2024

zasdfgbnm commented Sep 11, 2024

zasdfgbnm commented Sep 11, 2024

rdspring1 commented Sep 16, 2024

rdspring1 commented Sep 16, 2024 • edited Loading

rdspring1 Sep 16, 2024

Choose a reason for hiding this comment

rdspring1 commented Sep 17, 2024

rdspring1 commented Sep 9, 2024 •

edited

Loading

zasdfgbnm commented Sep 11, 2024 •

edited

Loading

zasdfgbnm commented Sep 11, 2024 •

edited

Loading

rdspring1 commented Sep 11, 2024 •

edited

Loading

rdspring1 commented Sep 16, 2024 •

edited

Loading