-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use loop promotion and indexing traversal graph to find mismatched parallelization #2875
Conversation
!build --diff |
1 similar comment
!build --diff |
!build --diff-bench |
590d3f4
to
30d3db5
Compare
!build --diff-bench |
!build --pybench |
4d646aa
to
dd1c039
Compare
@zasdfgbnm Could you take a look and let me know what you think? I refactored the sync analysis using the loop promotion results. For now, to make it easier to verify the results, both the existing and new analyses are used and their results are compared. If they don't agree, it's a signal that we should look into, so it'll throw an exception. To do the refactoring, I needed I also enabled the mismatching reshape test. The new analysis was able to find that there's no dependency. Let me know what you think. If it makes sense, I'll prepare a real PR for review. |
!build --pybench |
if (auto it = p2c_map_no_forwarding.find(p_id); | ||
it != p2c_map_no_forwarding.end() && it->second == c_id) { | ||
requires_sync = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this branch? Isn't the second branch itself sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was my initial thought, but it's actually not enough since the AlmostExact graph doesn't map broadcast and non-broadcast domains. For example, just a simple fusion like this would fail:
// t0: [i0]
// t1: [b1]
t2 = t0 + t1
// t2: [i2]
Suppose nothing is inlined, so there's no promotion. Since b1
and i2
are not mapped, this would result in requiring a synchronization, which isn't the case. t2
and t1
use different indices but no sync is needed.
That's why I also mentioned about the broadcast graph. If it were used instead of the AlmostExact graph, that would be sufficient as well.
I think in this case BestEffortReplay
would be good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If t1 is on smem, and both b1 and i2 are parallelized on TIDx, then we wouldn't need sync? For this case will 1 thread write smem, or all threads write it? Whichever it is, don't we need to wait until all writes are complete before reading?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, TIDx would be marked as a redundant parallel type and only one thread should write to smem. You're right a sync would be required, which should be taken care at lines around 580. The way how it's handled doesn't seem ideal to me, but I'm trying to make an incremental improvement.
c_loop = consumer_loop_id; | ||
const auto& indexing_traveral_graph = | ||
id_model.idGraph(IdMappingMode::ALMOSTEXACT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should this be tensor_indexer.traversalGraph()
? I know they are the same today, but in the future, if we use IEL for indexing, should we use IEL or almost exact?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Since right now it's also allowed to enable just IdModel without enabling the indexer, so tensor_indexer
may not exist. I'll add some static function to TensorIndexer
.
ab65811
to
b852d75
Compare
b852d75
to
72ae077
Compare
@@ -492,7 +495,7 @@ SyncMap::SyncMap(Fusion* fusion) { | |||
producer_redundant_types & (~producer_redundant_use_types); | |||
|
|||
for (const auto producer_i : c10::irange(producer->nDims())) { | |||
auto producer_axis = producer->axis(producer_i); | |||
auto producer_axis = producer->getLoopDomain().at(producer_i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessary, but just make it explicit that we are looking at the loop domain
@@ -653,6 +673,7 @@ SyncMap::SyncMap(Fusion* fusion) { | |||
producer->getLogicalDomain(), {p_id}) | |||
.empty()) { | |||
raw_dims.set(producer_ptype); | |||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related, but I believe this was just missing
!build |
!build |
@zasdfgbnm Could you please review again? Looks like my test was not comprehensive and I missed a simple case where a sync is required. I added a test to reproduce it. Specifically, mapping with Case 1: Case 2: Case 3: The previous version missed case 3. I updated the code and did the validation by comparing the new and old analyses. See #2907 for the actual validation. |
See #2850
Stacked on #2901
The old code is still used by default. With
NVFUSER_ENABLE=id_model
, the new analysis is used. It's also used for tensors with non-conventional domains.This is required for #2851. It also enables previously disabled parallelization of the mismatching reshape test from #2684.
I validated the change by comparing the results between the existing and new analyses with all the tests and benchmarks. The only mismatch was with the mismatching reshape test, for which the existing analysis declared a sync is required, whereas the new one correctly recognizes there's no cross-thread dependency.