Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use loop promotion and indexing traversal graph to find mismatched parallelization #2875

Merged
merged 13 commits into from
Sep 5, 2024

Conversation

naoyam
Copy link
Collaborator

@naoyam naoyam commented Aug 30, 2024

See #2850
Stacked on #2901

The old code is still used by default. With NVFUSER_ENABLE=id_model, the new analysis is used. It's also used for tensors with non-conventional domains.

This is required for #2851. It also enables previously disabled parallelization of the mismatching reshape test from #2684.

I validated the change by comparing the results between the existing and new analyses with all the tests and benchmarks. The only mismatch was with the mismatching reshape test, for which the existing analysis declared a sync is required, whereas the new one correctly recognizes there's no cross-thread dependency.

@naoyam
Copy link
Collaborator Author

naoyam commented Aug 30, 2024

!build --diff

1 similar comment
@naoyam
Copy link
Collaborator Author

naoyam commented Aug 30, 2024

!build --diff

@naoyam
Copy link
Collaborator Author

naoyam commented Aug 30, 2024

!build --diff-bench

@naoyam naoyam force-pushed the sync_info_idmodel branch from 590d3f4 to 30d3db5 Compare August 31, 2024 04:06
@naoyam
Copy link
Collaborator Author

naoyam commented Aug 31, 2024

!build --diff-bench

@naoyam
Copy link
Collaborator Author

naoyam commented Sep 1, 2024

!build --pybench

@naoyam
Copy link
Collaborator Author

naoyam commented Sep 3, 2024

@zasdfgbnm Could you take a look and let me know what you think? I refactored the sync analysis using the loop promotion results. For now, to make it easier to verify the results, both the existing and new analyses are used and their results are compared. If they don't agree, it's a signal that we should look into, so it'll throw an exception.

To do the refactoring, I needed BestEffortReplay that's truly "best effort". The current version checks if the root-to-logical expressions of a consumer tensor always appear in a corresponding producer tensor. I changed the check optional because all we need is to find out if a producer ID and a consumer ID are derived from mapped root/logical domains with mapped exprs. Alternatively, we could use the "broadcast" graph that would map broadcast and non-broadcast domains in addition to the Exact graph.

I also enabled the mismatching reshape test. The new analysis was able to find that there's no dependency.

Let me know what you think. If it makes sense, I'll prepare a real PR for review.

@naoyam
Copy link
Collaborator Author

naoyam commented Sep 3, 2024

!build --pybench

Comment on lines 694 to 697
if (auto it = p2c_map_no_forwarding.find(p_id);
it != p2c_map_no_forwarding.end() && it->second == c_id) {
requires_sync = false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this branch? Isn't the second branch itself sufficient?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my initial thought, but it's actually not enough since the AlmostExact graph doesn't map broadcast and non-broadcast domains. For example, just a simple fusion like this would fail:

// t0: [i0]
// t1: [b1]
t2 = t0 + t1
// t2: [i2]

Suppose nothing is inlined, so there's no promotion. Since b1 and i2 are not mapped, this would result in requiring a synchronization, which isn't the case. t2 and t1 use different indices but no sync is needed.

That's why I also mentioned about the broadcast graph. If it were used instead of the AlmostExact graph, that would be sufficient as well.

I think in this case BestEffortReplay would be good enough.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If t1 is on smem, and both b1 and i2 are parallelized on TIDx, then we wouldn't need sync? For this case will 1 thread write smem, or all threads write it? Whichever it is, don't we need to wait until all writes are complete before reading?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, TIDx would be marked as a redundant parallel type and only one thread should write to smem. You're right a sync would be required, which should be taken care at lines around 580. The way how it's handled doesn't seem ideal to me, but I'm trying to make an incremental improvement.

Comment on lines 710 to 712
c_loop = consumer_loop_id;
const auto& indexing_traveral_graph =
id_model.idGraph(IdMappingMode::ALMOSTEXACT);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should this be tensor_indexer.traversalGraph()? I know they are the same today, but in the future, if we use IEL for indexing, should we use IEL or almost exact?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Since right now it's also allowed to enable just IdModel without enabling the indexer, so tensor_indexer may not exist. I'll add some static function to TensorIndexer.

@naoyam naoyam changed the base branch from main to relaxed_best_effort_replay September 4, 2024 19:43
@@ -492,7 +495,7 @@ SyncMap::SyncMap(Fusion* fusion) {
producer_redundant_types & (~producer_redundant_use_types);

for (const auto producer_i : c10::irange(producer->nDims())) {
auto producer_axis = producer->axis(producer_i);
auto producer_axis = producer->getLoopDomain().at(producer_i);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary, but just make it explicit that we are looking at the loop domain

@@ -653,6 +673,7 @@ SyncMap::SyncMap(Fusion* fusion) {
producer->getLogicalDomain(), {p_id})
.empty()) {
raw_dims.set(producer_ptype);
continue;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related, but I believe this was just missing

@naoyam
Copy link
Collaborator Author

naoyam commented Sep 4, 2024

!build

@naoyam naoyam marked this pull request as ready for review September 4, 2024 19:52
@naoyam naoyam requested a review from zasdfgbnm September 4, 2024 19:52
@naoyam naoyam changed the title [WIP] Use loop promotion and indexing traversal graph to find mismatched parallelization Use loop promotion and indexing traversal graph to find mismatched parallelization Sep 4, 2024
naoyam added a commit that referenced this pull request Sep 4, 2024
Base automatically changed from relaxed_best_effort_replay to main September 4, 2024 20:26
@naoyam
Copy link
Collaborator Author

naoyam commented Sep 5, 2024

!build

@naoyam naoyam requested a review from zasdfgbnm September 5, 2024 08:36
@naoyam
Copy link
Collaborator Author

naoyam commented Sep 5, 2024

@zasdfgbnm Could you please review again? Looks like my test was not comprehensive and I missed a simple case where a sync is required. I added a test to reproduce it. Specifically, mapping with BestEffortReplay doesn't tell no synchronization if a producer domain includes a broadcast and a non broadcast.

Case 1:
Producer loop ID: broadcast (which may be produced by merging multiple broadcast domains)
Consumer loop ID: non-broadcast
-> They are not exactly mapped but sync is not necessary as discussed above.

Case 2:
Producer loop ID: non-broadcast
Consumer loop ID: non-broadcast
-> No sync if they are exactly mapped. This case is covered by the promotion check.

Case 3:
Producer loop ID: non-broadcast
Consumer loop ID: non-broadcast
-> Sync required if they are not exactly mapped, even when they are mapped by the best effort replay. (See the new test for a simple repro)

The previous version missed case 3.

I updated the code and did the validation by comparing the new and old analyses. See #2907 for the actual validation.

@naoyam naoyam merged commit 1158543 into main Sep 5, 2024
5 checks passed
@naoyam naoyam deleted the sync_info_idmodel branch September 5, 2024 16:34
naoyam added a commit that referenced this pull request Sep 5, 2024
Slice and concat patterns without rotation. See #2851.

Stacked on #2897, #2875
Closes #2870
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants