-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Harden assertBuffersHaveSameSize to check shapes. #3531
Conversation
// and a MatmulOp, and StmtSort::getExprs only traverse via the first | ||
// registered definition (i.e. the Select). This sounds like a bug -- I wonder | ||
// how nvFuser resets the TensorView uses of a kir::Kernel, also non-SSA. | ||
hic->addOutput(tva_j_unsqueezed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The introduction of tva_j_unsqueezed
triggered a weird problem that @samnordmann is probably aware of. I added more explanation and wonder what @naoyam think about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could make TensorView live even if no output depends on it for HostIR. Not sure if that would solve the issue, though, as I'm still not entirely clear what the issue is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am aware we artificially need to add the matmul's output as a fusion output to fix the data dependency, that is why tvc_j
was added in the first place. However, I was not aware of the other bug you're mentioning -- that we only traverse the first registered producing Expr.
Would the program break if you only let hic->addOutput(tvc_j);
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the program break if you only let hic->addOutput(tvc_j);?
Yes as I commented at https://github.com/NVIDIA/Fuser/pull/3531/files#diff-30df6421558f87ef0024b01f11752c35d3d68b80a9e6e0ec0fd49de535acb91aR917
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok but I am not sure to fully understand the reason why it breaks. Even if the visitor only traverses through the first definition, i.e., the SelectOp
, then tvc_j
should still be invalidated because the SelectOp
consumes the index j
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, tvcj will be invalidated but tvaj unsqueeze won't be. As a result it holds always hold the first iteration value
!test |
// We could have added `tvc_j` instead as an output, which transitively | ||
// consumes `tva_j_unsqueezed`. However, `tvc_j` has two definitions, a Select | ||
// and a MatmulOp, and StmtSort::getExprs only traverse via the first | ||
// registered definition (i.e. the Select). This sounds like a bug -- I wonder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StmtSort and other stuff in iter_visiter.h assume the SSA property of Fusion
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that mean TVs in a kir::Kernel (also non-SSA) get wrong Val::uses()
, which should be avoided using?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be actually used, but non-SSA definitions in the Kernel IR are pretty limited so far, so we may not encounter any problems. But in general, it isn't a well ironed out use scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I'll have to revisit how we evaluate for-loops. One potential approach is to only invalidate loop-index-dependent scalars and let TensorView ops in the loop body run unconditionally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks!
// and a MatmulOp, and StmtSort::getExprs only traverse via the first | ||
// registered definition (i.e. the Select). This sounds like a bug -- I wonder | ||
// how nvFuser resets the TensorView uses of a kir::Kernel, also non-SSA. | ||
hic->addOutput(tva_j_unsqueezed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am aware we artificially need to add the matmul's output as a fusion output to fix the data dependency, that is why tvc_j
was added in the first place. However, I was not aware of the other bug you're mentioning -- that we only traverse the first registered producing Expr.
Would the program break if you only let hic->addOutput(tvc_j);
?
I wrote this to make the allgather-related issue discovered in #3284 (comment) easier to expose. And it seems a good runtime check to have in extra, because
_allgather_base
treats I/O tensors as flat buffers and ignores the shapes.