Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allocation order refactor #2168

Merged
merged 84 commits into from
May 14, 2024
Merged

Conversation

jjsjann123
Copy link
Collaborator

@jjsjann123 jjsjann123 commented May 1, 2024

refactored allocation order inference pass:

  • Instead of per operation propagation rule, we are now using IdModel mapping to map allocation domain of reference tensor to rfactor domain of target tensor.
  • Updated the inference API to allow specified sources and destinations for the propagation.
void inferenceAllocationOrder(
  Fusion* fusion,
  const std::vector<TensorView*>& srcs,
  const std::vector<TensorView*>& dsts);
  • The propagation tried to keep the memory format of dsts closer to the srcs to simplify scheduling as well as facilitate vectorization. It works roughly as:
    • For each entry dst, among all its producers in srcs, we'll find the one with the most loop iter domain in its allocation domain as the reference ref
    • We try to map each iter domain in dst's rfactor domain to ref's allocation order domain and push those as the inner dimension in dst's new allocation domain, while pushing unmapped iter domains as outer dimensions.
    • I have to put in a WAR for the mapping logic for now, since reduction scheduler is struggling with permuted output. See issue Reduction scheduler does not handle allocation domain properly and trigger assert when reduction output has specified allocation domain #2202. The WAR is simply to preserve the existing position of reduction iter domain in rfactor the same as it would be in its new allocation domain. This WAR is supposed to be removed at a later point once we fixed reduction scheduler. I kept both code path in the PR for easier future cleanup.

@jjsjann123
Copy link
Collaborator Author

local test failures are encountered here:

[  FAILED  ] 9 tests, listed below:
[  FAILED  ] NVFuserTest.CombinedSchedulerSharedProducer_CUDA
[  FAILED  ] NVFuserTest.DynamicSqueezeTrivialWelford
[  FAILED  ] AliasTest.NotAllOutputsAlias_Reduction
[  FAILED  ] IndexingOpTest.TakeAlongAxisIntermediateTensorTranspose1_CUDA
[  FAILED  ] TransposeTest.FusionTransposeSelfMapping
[  FAILED  ] TransposeTest.TransposeAggregatedVectorizationWidth
[  FAILED  ] TransposeTest.TransposeSplitAggregatedVectorizationWidth
[  FAILED  ] MoveSplitCatTest.Noncancellable_SomeButNotAllArePermuted
[  FAILED  ] ResizeTest.PadScheduler4

I'm suspecting most to be test related failures, certain tests are likely expecting output allocation to be in certain way that's now violated from this change. I'll try to clean them up a bit.

}
// TODO: open an issue. seems to hit an assert in IdModel(&fusion)
// {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to myself. verify this with ToT main. I'm guessing it's just some idmodel config that I wasn't using properly.

@jjsjann123
Copy link
Collaborator Author

jjsjann123 commented May 1, 2024

local test failures are encountered here:

[  FAILED  ] IndexingOpTest.TakeAlongAxisIntermediateTensorTranspose1_CUDA
[  FAILED  ] TransposeTest.FusionTransposeSelfMapping
[  FAILED  ] TransposeTest.TransposeAggregatedVectorizationWidth
[  FAILED  ] TransposeTest.TransposeSplitAggregatedVectorizationWidth
[  FAILED  ] MoveSplitCatTest.Noncancellable_SomeButNotAllArePermuted

@jjsjann123
Copy link
Collaborator Author

I think we can take a short-cut for matmul/linear. We can add inputs to matmul/linear ops in our dsts vector and infer their allocation domain from in the global fusion IR. Hopefully that's still going to be a right decision to make?! 🤞 Open to discussion on this as well

Sounds like an interesting approach. Let's try that (in a separate PR).

SGTM

@jacobhinkle
Copy link
Collaborator

If we infer a global order of all ValGroups based on all inputs, and make this information available to all schedulers, could that be a solution to #2198?

Doesn't transpose mean we cannot have a total order on ValGroups that places innermost dims properly everywhere?

csrc/preseg_passes/allocation_order_inference.cpp Outdated Show resolved Hide resolved
std::vector<IterDomain*> mapped_id_vec;
std::unordered_set<IterDomain*> mapped_id_set;

// logic to preserve reduction iter domain in target to WAR issue #2202
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to be removed once #2202 is addressed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that's the plan.

@jjsjann123
Copy link
Collaborator Author

!bulid --diff

@jjsjann123
Copy link
Collaborator Author

!build --diff

@jjsjann123
Copy link
Collaborator Author

!build --diff

@jjsjann123
Copy link
Collaborator Author

jit_thunder_tests on A100 seems to be reporting kernel with wrong arch.
00:00:36 FATAL: kernel fmha_cutlassF_f32_aligned_64x64_rf_sm75 is for sm75-sm80, but was built for sm70

Don't think that one is coming from this PR and I don't see a CI nightly with thunder failure. cc'ing @xwang233

@xwang233
Copy link
Collaborator

xwang233 commented May 10, 2024

jit_thunder_tests on A100 seems to be reporting kernel with wrong arch. 00:00:36 FATAL: kernel fmha_cutlassF_f32_aligned_64x64_rf_sm75 is for sm75-sm80, but was built for sm70

Don't think that one is coming from this PR and I don't see a CI nightly with thunder failure. cc'ing @xwang233

Somehow that test job got a T4 GPU instead of A100. Need to investigate that. Please restart new jobs CI if needed.

@jjsjann123
Copy link
Collaborator Author

!build

@jjsjann123 jjsjann123 requested review from zasdfgbnm and naoyam May 13, 2024 20:22
@jjsjann123
Copy link
Collaborator Author

I addressed all issues in comment. I'll be merging the PR after CI clears again since I have already got a stamp earlier.

Please do block merge if you have further concerns. Regarding comments on propagation rule for reshape. Let's move the discussion here #2235 .

@jjsjann123
Copy link
Collaborator Author

Thunder failure isn't related. I'm merging this.

@jjsjann123 jjsjann123 merged commit 8c18701 into NVIDIA:main May 14, 2024
33 of 35 checks passed
@jjsjann123 jjsjann123 deleted the allocation_order_refactor branch May 14, 2024 08:04
@wujingyue wujingyue mentioned this pull request May 14, 2024
wujingyue pushed a commit that referenced this pull request May 15, 2024
jjsjann123 added a commit that referenced this pull request May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants