Harden assertBuffersHaveSameSize to check shapes. #3531

wujingyue · 2024-12-05T07:01:21Z

I wrote this to make the allgather-related issue discovered in #3284 (comment) easier to expose. And it seems a good runtime check to have in extra, because _allgather_base treats I/O tensors as flat buffers and ignores the shapes.

tests/cpp/test_multidevice_overlap.cpp

wujingyue · 2024-12-05T07:04:39Z

tests/cpp/test_multidevice_overlap.cpp

+  // and a MatmulOp, and StmtSort::getExprs only traverse via the first
+  // registered definition (i.e. the Select). This sounds like a bug -- I wonder
+  // how nvFuser resets the TensorView uses of a kir::Kernel, also non-SSA.
+  hic->addOutput(tva_j_unsqueezed);


The introduction of tva_j_unsqueezed triggered a weird problem that @samnordmann is probably aware of. I added more explanation and wonder what @naoyam think about this.

We could make TensorView live even if no output depends on it for HostIR. Not sure if that would solve the issue, though, as I'm still not entirely clear what the issue is.

I am aware we artificially need to add the matmul's output as a fusion output to fix the data dependency, that is why tvc_j was added in the first place. However, I was not aware of the other bug you're mentioning -- that we only traverse the first registered producing Expr.

Would the program break if you only let hic->addOutput(tvc_j); ?

Would the program break if you only let hic->addOutput(tvc_j);?

Yes as I commented at https://github.com/NVIDIA/Fuser/pull/3531/files#diff-30df6421558f87ef0024b01f11752c35d3d68b80a9e6e0ec0fd49de535acb91aR917

Ok but I am not sure to fully understand the reason why it breaks. Even if the visitor only traverses through the first definition, i.e., the SelectOp, then tvc_j should still be invalidated because the SelectOp consumes the index j

Yes, tvcj will be invalidated but tvaj unsqueeze won't be. As a result it holds always hold the first iteration value

wujingyue · 2024-12-05T07:04:50Z

!test

naoyam · 2024-12-05T18:02:37Z

tests/cpp/test_multidevice_overlap.cpp

+  // We could have added `tvc_j` instead as an output, which transitively
+  // consumes `tva_j_unsqueezed`. However, `tvc_j` has two definitions, a Select
+  // and a MatmulOp, and StmtSort::getExprs only traverse via the first
+  // registered definition (i.e. the Select). This sounds like a bug -- I wonder


StmtSort and other stuff in iter_visiter.h assume the SSA property of Fusion.

Does that mean TVs in a kir::Kernel (also non-SSA) get wrong Val::uses(), which should be avoided using?

It may be actually used, but non-SSA definitions in the Kernel IR are pretty limited so far, so we may not encounter any problems. But in general, it isn't a well ironed out use scenario.

Got it. I'll have to revisit how we evaluate for-loops. One potential approach is to only invalidate loop-index-dependent scalars and let TensorView ops in the loop body run unconditionally.

samnordmann

Looks good, thanks!

samnordmann · 2024-12-06T14:43:31Z

tests/cpp/test_multidevice_overlap.cpp

+  // and a MatmulOp, and StmtSort::getExprs only traverse via the first
+  // registered definition (i.e. the Select). This sounds like a bug -- I wonder
+  // how nvFuser resets the TensorView uses of a kir::Kernel, also non-SSA.
+  hic->addOutput(tva_j_unsqueezed);


I am aware we artificially need to add the matmul's output as a fusion output to fix the data dependency, that is why tvc_j was added in the first place. However, I was not aware of the other bug you're mentioning -- that we only traverse the first registered producing Expr.

Would the program break if you only let hic->addOutput(tvc_j); ?

Harden assertBuffersHaveSameSize to check shapes.

33bf6ba

wujingyue commented Dec 5, 2024

View reviewed changes

tests/cpp/test_multidevice_overlap.cpp Show resolved Hide resolved

wujingyue commented Dec 5, 2024

View reviewed changes

wujingyue mentioned this pull request Dec 5, 2024

Allgather with DID loop split #3284

Merged

wujingyue requested review from samnordmann and naoyam December 5, 2024 07:48

naoyam reviewed Dec 5, 2024

View reviewed changes

samnordmann approved these changes Dec 6, 2024

View reviewed changes

wujingyue merged commit 76483fe into main Dec 6, 2024
48 checks passed

wujingyue deleted the wjy/shape branch December 6, 2024 19:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harden assertBuffersHaveSameSize to check shapes. #3531

Harden assertBuffersHaveSameSize to check shapes. #3531

wujingyue commented Dec 5, 2024

wujingyue Dec 5, 2024

naoyam Dec 5, 2024

samnordmann Dec 6, 2024

wujingyue Dec 6, 2024

samnordmann Dec 8, 2024

wujingyue Dec 8, 2024

wujingyue commented Dec 5, 2024

naoyam Dec 5, 2024

wujingyue Dec 5, 2024

naoyam Dec 5, 2024

wujingyue Dec 6, 2024

samnordmann left a comment

samnordmann Dec 6, 2024

Harden assertBuffersHaveSameSize to check shapes. #3531

Harden assertBuffersHaveSameSize to check shapes. #3531

Conversation

wujingyue commented Dec 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wujingyue commented Dec 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samnordmann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment