Use loop promotion and indexing traversal graph to find mismatched parallelization #2875

naoyam · 2024-08-30T17:07:58Z

See #2850
Stacked on #2901

The old code is still used by default. With NVFUSER_ENABLE=id_model, the new analysis is used. It's also used for tensors with non-conventional domains.

This is required for #2851. It also enables previously disabled parallelization of the mismatching reshape test from #2684.

I validated the change by comparing the results between the existing and new analyses with all the tests and benchmarks. The only mismatch was with the mismatching reshape test, for which the existing analysis declared a sync is required, whereas the new one correctly recognizes there's no cross-thread dependency.

naoyam · 2024-08-30T17:23:59Z

!build --diff

naoyam · 2024-08-30T19:05:48Z

!build --diff

naoyam · 2024-08-30T23:37:47Z

!build --diff-bench

naoyam · 2024-08-31T04:06:21Z

!build --diff-bench

naoyam · 2024-09-01T21:12:55Z

!build --pybench

naoyam · 2024-09-03T20:32:49Z

@zasdfgbnm Could you take a look and let me know what you think? I refactored the sync analysis using the loop promotion results. For now, to make it easier to verify the results, both the existing and new analyses are used and their results are compared. If they don't agree, it's a signal that we should look into, so it'll throw an exception.

To do the refactoring, I needed BestEffortReplay that's truly "best effort". The current version checks if the root-to-logical expressions of a consumer tensor always appear in a corresponding producer tensor. I changed the check optional because all we need is to find out if a producer ID and a consumer ID are derived from mapped root/logical domains with mapped exprs. Alternatively, we could use the "broadcast" graph that would map broadcast and non-broadcast domains in addition to the Exact graph.

I also enabled the mismatching reshape test. The new analysis was able to find that there's no dependency.

Let me know what you think. If it makes sense, I'll prepare a real PR for review.

naoyam · 2024-09-03T20:34:45Z

!build --pybench

zasdfgbnm · 2024-09-03T22:49:37Z

csrc/device_lower/analysis/sync_information.cpp

+              if (auto it = p2c_map_no_forwarding.find(p_id);
+                  it != p2c_map_no_forwarding.end() && it->second == c_id) {
+                requires_sync = false;


Why do we need this branch? Isn't the second branch itself sufficient?

That was my initial thought, but it's actually not enough since the AlmostExact graph doesn't map broadcast and non-broadcast domains. For example, just a simple fusion like this would fail:

// t0: [i0] // t1: [b1] t2 = t0 + t1 // t2: [i2]

Suppose nothing is inlined, so there's no promotion. Since b1 and i2 are not mapped, this would result in requiring a synchronization, which isn't the case. t2 and t1 use different indices but no sync is needed.

That's why I also mentioned about the broadcast graph. If it were used instead of the AlmostExact graph, that would be sufficient as well.

I think in this case BestEffortReplay would be good enough.

If t1 is on smem, and both b1 and i2 are parallelized on TIDx, then we wouldn't need sync? For this case will 1 thread write smem, or all threads write it? Whichever it is, don't we need to wait until all writes are complete before reading?

In that case, TIDx would be marked as a redundant parallel type and only one thread should write to smem. You're right a sync would be required, which should be taken care at lines around 580. The way how it's handled doesn't seem ideal to me, but I'm trying to make an incremental improvement.

zasdfgbnm · 2024-09-04T17:26:28Z

csrc/device_lower/analysis/sync_information.cpp

+                c_loop = consumer_loop_id;
+                const auto& indexing_traveral_graph =
+                    id_model.idGraph(IdMappingMode::ALMOSTEXACT);


nit: should this be tensor_indexer.traversalGraph()? I know they are the same today, but in the future, if we use IEL for indexing, should we use IEL or almost exact?

Yes. Since right now it's also allowed to enable just IdModel without enabling the indexer, so tensor_indexer may not exist. I'll add some static function to TensorIndexer.

parallelization

naoyam · 2024-09-04T19:44:29Z

csrc/device_lower/analysis/sync_information.cpp

@@ -492,7 +495,7 @@ SyncMap::SyncMap(Fusion* fusion) {
          producer_redundant_types & (~producer_redundant_use_types);

      for (const auto producer_i : c10::irange(producer->nDims())) {
-        auto producer_axis = producer->axis(producer_i);
+        auto producer_axis = producer->getLoopDomain().at(producer_i);


Not necessary, but just make it explicit that we are looking at the loop domain

naoyam · 2024-09-04T19:45:03Z

csrc/device_lower/analysis/sync_information.cpp

@@ -653,6 +673,7 @@ SyncMap::SyncMap(Fusion* fusion) {
                     producer->getLogicalDomain(), {p_id})
                     .empty()) {
              raw_dims.set(producer_ptype);
+              continue;


Not related, but I believe this was just missing

naoyam · 2024-09-04T19:52:34Z

!build

…#2901) Extracted from and required for #2875

naoyam · 2024-09-05T04:54:29Z

!build

naoyam · 2024-09-05T08:50:28Z

@zasdfgbnm Could you please review again? Looks like my test was not comprehensive and I missed a simple case where a sync is required. I added a test to reproduce it. Specifically, mapping with BestEffortReplay doesn't tell no synchronization if a producer domain includes a broadcast and a non broadcast.

Case 1:
Producer loop ID: broadcast (which may be produced by merging multiple broadcast domains)
Consumer loop ID: non-broadcast
-> They are not exactly mapped but sync is not necessary as discussed above.

Case 2:
Producer loop ID: non-broadcast
Consumer loop ID: non-broadcast
-> No sync if they are exactly mapped. This case is covered by the promotion check.

Case 3:
Producer loop ID: non-broadcast
Consumer loop ID: non-broadcast
-> Sync required if they are not exactly mapped, even when they are mapped by the best effort replay. (See the new test for a simple repro)

The previous version missed case 3.

I updated the code and did the validation by comparing the new and old analyses. See #2907 for the actual validation.

Slice and concat patterns without rotation. See #2851. Stacked on #2897, #2875 Closes #2870

naoyam force-pushed the sync_info_idmodel branch from 590d3f4 to 30d3db5 Compare August 31, 2024 04:06

naoyam force-pushed the sync_info_idmodel branch from 4d646aa to dd1c039 Compare September 3, 2024 18:32

zasdfgbnm reviewed Sep 3, 2024

View reviewed changes

zasdfgbnm reviewed Sep 4, 2024

View reviewed changes

Allow silently ignore missing root-to-logical ops in BestEffortReplay

021d873

naoyam mentioned this pull request Sep 4, 2024

Allow silently ignore missing root-to-logical ops in BestEffortReplay #2901

Merged

naoyam added 7 commits September 4, 2024 12:14

Use loop promotion and indexing traversal graph to find mismatched

c335486

parallelization

enable idmodel

84ff7a8

fix

1bff20d

format

1362221

cleanup

c5adea4

cleanup

e603106

format

1371a85

naoyam force-pushed the sync_info_idmodel branch from ab65811 to b852d75 Compare September 4, 2024 19:32

cleanup and enable the previously-failing reshape parallelization

72ae077

naoyam force-pushed the sync_info_idmodel branch from b852d75 to 72ae077 Compare September 4, 2024 19:43

naoyam changed the base branch from main to relaxed_best_effort_replay September 4, 2024 19:43

naoyam commented Sep 4, 2024

View reviewed changes

naoyam marked this pull request as ready for review September 4, 2024 19:52

naoyam requested a review from zasdfgbnm September 4, 2024 19:52

naoyam changed the title ~~[WIP] Use loop promotion and indexing traversal graph to find mismatched parallelization~~ Use loop promotion and indexing traversal graph to find mismatched parallelization Sep 4, 2024

zasdfgbnm approved these changes Sep 4, 2024

View reviewed changes

naoyam added a commit that referenced this pull request Sep 4, 2024

Allow silently ignore missing root-to-logical ops in BestEffortReplay (…

7be78f8

…#2901) Extracted from and required for #2875

Base automatically changed from relaxed_best_effort_replay to main September 4, 2024 20:26

naoyam added 2 commits September 4, 2024 21:53

fix

708555c

Merge branch 'main' into sync_info_idmodel

b686a7b

disable idmodel

aea0fae

naoyam requested a review from zasdfgbnm September 5, 2024 08:36

cleanup

5947d39

zasdfgbnm approved these changes Sep 5, 2024

View reviewed changes

naoyam merged commit 1158543 into main Sep 5, 2024
5 checks passed

naoyam deleted the sync_info_idmodel branch September 5, 2024 16:34

naoyam mentioned this pull request Sep 5, 2024

Refactor the RAW sync analysis with Idmodel #2850

Closed

naoyam added a commit that referenced this pull request Sep 5, 2024

Add slice tests to demonstrate manual scheduling (#2898)

0058da9

Slice and concat patterns without rotation. See #2851. Stacked on #2897, #2875 Closes #2870

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use loop promotion and indexing traversal graph to find mismatched parallelization #2875

Use loop promotion and indexing traversal graph to find mismatched parallelization #2875

naoyam commented Aug 30, 2024 •

edited

Loading

naoyam commented Aug 30, 2024

naoyam commented Aug 30, 2024

naoyam commented Aug 30, 2024

naoyam commented Aug 31, 2024

naoyam commented Sep 1, 2024

naoyam commented Sep 3, 2024 •

edited

Loading

naoyam commented Sep 3, 2024

zasdfgbnm Sep 3, 2024

naoyam Sep 3, 2024

zasdfgbnm Sep 4, 2024

naoyam Sep 4, 2024

zasdfgbnm Sep 4, 2024

naoyam Sep 4, 2024

naoyam Sep 4, 2024

naoyam Sep 4, 2024

naoyam commented Sep 4, 2024

naoyam commented Sep 5, 2024

naoyam commented Sep 5, 2024

Use loop promotion and indexing traversal graph to find mismatched parallelization #2875

Use loop promotion and indexing traversal graph to find mismatched parallelization #2875

Conversation

naoyam commented Aug 30, 2024 • edited Loading

naoyam commented Aug 30, 2024

naoyam commented Aug 30, 2024

naoyam commented Aug 30, 2024

naoyam commented Aug 31, 2024

naoyam commented Sep 1, 2024

naoyam commented Sep 3, 2024 • edited Loading

naoyam commented Sep 3, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

naoyam commented Sep 4, 2024

naoyam commented Sep 5, 2024

naoyam commented Sep 5, 2024

naoyam commented Aug 30, 2024 •

edited

Loading

naoyam commented Sep 3, 2024 •

edited

Loading