-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Schedule epilogue (for Hopper Matmul) by propagation backward from output - smem epilogue not supported. #3580
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Only comments are around code organization.
// Apply mma common transformation | ||
c->split(-2, getM(params_->mma_macro)); | ||
c->split(-1, getN(params_->mma_macro)); | ||
// [..., Mo, No, Mio, Mii, Nio, Nii] | ||
// -> [..., Mo, No, Mio, Nio, Mii, Nii] | ||
c->reorder({{-3, -2}}); | ||
c->merge(-4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this pattern a lot. Can we make this a utility method in this class and give it an informative name?
c->reorder({{-3, -2}}); | ||
c->merge(-4); | ||
|
||
// [...., Mii, Nii] -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is a little hard for me to follow since there are some intermediate splits that are not shown. For example we start with Mii
then get Miii(8)
and Mii/16
, so is that an (Mii/2)/8
(I think so)?
Since this is a complicated function maybe it is best to just skip to the last line (line 439) that shows the final sizes, and maybe give them different names.
auto s = | ||
mma_utils::MmaSwizzler::scheduleMmaOutputAllocation(c->getLoopDomain()); | ||
c->setLoopDomain(s.as<IterDomain*>()); | ||
c->axis(-5)->parallelize(ParallelType::TIDy); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned in #3575 (comment) that this whole chunk should basically be a new method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll wait for @rdspring1 to land his PR then use his fix or add one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll wait to review until you pull the latests changes but the PR looks good to me. 🎉
94f39a1
to
be5f18b
Compare
!build |
2, | ||
{c}, | ||
{ParallelType::BIDx, ParallelType::BIDy, ParallelType::BIDz}); | ||
// Propato to (not including) the splitk output if there is a splitk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Propato to (not including) the splitk output if there is a splitk | |
// Propagate to (not including) the splitk output if there is a splitk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Just be aware of #3590. The current use_smem_epilogue
epilogue will have conflicts. Your changes should be fine.
scheduler_utils::BoundedDirectionalTransformPropagator::backward( | ||
d, | ||
-1, | ||
mma_results_, | ||
propagate_to, | ||
scheduler_utils::BoundedDirectionalTransformPropagator::Options() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note: we'll need to revisit this path for cases where TMA is disabled. We need to make sure it is doing coalesced stores here (can check with ncu). I imagine that we will almost always want to use TMA, but it's possible that we would skip it if it increases smem usage too much. In those cases we'll hit this path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I think this gets us close to full functionality. BTW since smem epilogue is supported please change the title and PR description.
Not as yet - this PR doesn't support smem epilogue. |
Oh OK nevermind then. |
895df0f
to
09cb407
Compare
!test |
This adds support for scheduling epilogue for the hopper matmul scheduler.
We don't support smem epilogue as yet.
We also don't honor the vectorization_factor as yet for the store to output. That'll be covered in a separate PR.