-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding resize(PadOp) vectorization analysis #3321
Conversation
This reverts commit d0addc4.
Co-authored-by: Naoya Maruyama <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It overall looks good. Just would like a few things I commented about to get addressed.
!test --pybench |
Initiated testing with python benchmarks just in case. |
Thanks, I'll address the issues you brought up as well as running through some real size problem so we get a taste of the perf impact. 🙇 |
!test --pybench |
!test --pybench |
Did a quick look at the perf. The end-2-end time looks very noisy. I'm a bit unsure about my measuring script so instead just did a nsys on the kernel time. On A100 80GB PCIe, peak bandwidth is 2TB/s.
Something like this vvv.
|
I think review comments have been addressed as well. CI was green before my benchmark. Ready for a final review. |
!test --pybench |
build failure seems to be flaky. I re-started the CI on that one and it passed internally. Unfortunately it didn't update the github status here. Not a big issue but cc'ing @xwang233 in case this is something you are not aware of. |
csrc/tensor_view.cpp
Outdated
NVF_ERROR( | ||
unique_uses.count(use), | ||
"cached_uses is not among the use of the TensorView"); | ||
target_uses.push_back(use); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this ordering still non deterministic? I think the parameter itself needs to be deterministically ordered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Sorry I missed the cached_uses
. Let me give it another try.
} else { | ||
// avoid non-determinism and ensure unique | ||
std::unordered_set<Expr*> unique_uses; | ||
auto this_uses = uses(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to have duplicates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. since we have uses_
as private and addUse does skip duplicates
Lines 96 to 102 in a5022da
bool Val::addUse(Expr* expr) { | |
if (std::find(uses_.begin(), uses_.end(), expr) == uses_.end()) { | |
uses_.push_back(expr); | |
return true; | |
} | |
return false; | |
} |
But I'm a bit scared by just leaving it not checked, since it's only a std::vector
and it's up to the implementation to change that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, let's make it an error if a duplicate is found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switched. I'll kick off the CI to see if there's any surprises. 🤞
!test --pybench |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Adding conditional support of reszie in vectorization analysis. This PR allows vectorized load on
PadOp
directly without using cache load. This PR improves performance of generated kernel.What's in this PR:
Add propagation rule for resize in vectorization analysis. The propagation rule works as:
i. For supported resize: a). project the resize op to the frontier and clear
(frontier.begin(), resize_position)
; b). add projected extent of the new resize op asgcd(id_from, resize_op->leftExpand(), resize_op->rightExpand)
ii. For unsupported resize: clear
[frontier.begin(), resize_position]
; no behavior change.updating TensorView::cacheAfter to opt-in a set of uses to cache while leaving other uses unchanged. Necessary for cases where inputs are used by PadOp as well as other operation that relies on cached load for vectorization.
Follow up to #3261.
Work for supporting rope performance. design doc: