Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harden assertBuffersHaveSameSize to check shapes. #3531

Merged
merged 1 commit into from
Dec 6, 2024
Merged

Conversation

wujingyue
Copy link
Collaborator

I wrote this to make the allgather-related issue discovered in #3284 (comment) easier to expose. And it seems a good runtime check to have in extra, because _allgather_base treats I/O tensors as flat buffers and ignores the shapes.

// and a MatmulOp, and StmtSort::getExprs only traverse via the first
// registered definition (i.e. the Select). This sounds like a bug -- I wonder
// how nvFuser resets the TensorView uses of a kir::Kernel, also non-SSA.
hic->addOutput(tva_j_unsqueezed);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The introduction of tva_j_unsqueezed triggered a weird problem that @samnordmann is probably aware of. I added more explanation and wonder what @naoyam think about this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could make TensorView live even if no output depends on it for HostIR. Not sure if that would solve the issue, though, as I'm still not entirely clear what the issue is.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am aware we artificially need to add the matmul's output as a fusion output to fix the data dependency, that is why tvc_j was added in the first place. However, I was not aware of the other bug you're mentioning -- that we only traverse the first registered producing Expr.

Would the program break if you only let hic->addOutput(tvc_j); ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the program break if you only let hic->addOutput(tvc_j);?

Yes as I commented at https://github.com/NVIDIA/Fuser/pull/3531/files#diff-30df6421558f87ef0024b01f11752c35d3d68b80a9e6e0ec0fd49de535acb91aR917

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok but I am not sure to fully understand the reason why it breaks. Even if the visitor only traverses through the first definition, i.e., the SelectOp, then tvc_j should still be invalidated because the SelectOp consumes the index j

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, tvcj will be invalidated but tvaj unsqueeze won't be. As a result it holds always hold the first iteration value

@wujingyue
Copy link
Collaborator Author

!test

// We could have added `tvc_j` instead as an output, which transitively
// consumes `tva_j_unsqueezed`. However, `tvc_j` has two definitions, a Select
// and a MatmulOp, and StmtSort::getExprs only traverse via the first
// registered definition (i.e. the Select). This sounds like a bug -- I wonder
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StmtSort and other stuff in iter_visiter.h assume the SSA property of Fusion.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean TVs in a kir::Kernel (also non-SSA) get wrong Val::uses(), which should be avoided using?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be actually used, but non-SSA definitions in the Kernel IR are pretty limited so far, so we may not encounter any problems. But in general, it isn't a well ironed out use scenario.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I'll have to revisit how we evaluate for-loops. One potential approach is to only invalidate loop-index-dependent scalars and let TensorView ops in the loop body run unconditionally.

Copy link
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

// and a MatmulOp, and StmtSort::getExprs only traverse via the first
// registered definition (i.e. the Select). This sounds like a bug -- I wonder
// how nvFuser resets the TensorView uses of a kir::Kernel, also non-SSA.
hic->addOutput(tva_j_unsqueezed);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am aware we artificially need to add the matmul's output as a fusion output to fix the data dependency, that is why tvc_j was added in the first place. However, I was not aware of the other bug you're mentioning -- that we only traverse the first registered producing Expr.

Would the program break if you only let hic->addOutput(tvc_j); ?

@wujingyue wujingyue merged commit 76483fe into main Dec 6, 2024
48 checks passed
@wujingyue wujingyue deleted the wjy/shape branch December 6, 2024 19:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants