Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation failure in matmul + reshape fusion #2127

Closed
Priya2698 opened this issue Apr 22, 2024 · 6 comments
Closed

Segmentation failure in matmul + reshape fusion #2127

Priya2698 opened this issue Apr 22, 2024 · 6 comments
Assignees
Labels
bug Something isn't working Matmuls Segmentation Issues related to nvFuser Segmentation Top-Down Matmul Dev

Comments

@Priya2698
Copy link
Collaborator

Repro from @jjsjann123: Lightning-AI/lightning-thunder#207 (comment)

import torch
from nvfuser import FusionDefinition, DataType

def nvfuser_fusion_id0(fd : FusionDefinition) -> None :
    T0 = fd.define_tensor(shape=[-1, -1, -1], contiguity=[True, True, True], dtype=DataType.BFloat16, is_cpu=False, stride_order=[2, 1, 0])
    T1 = fd.define_tensor(shape=[-1, -1], contiguity=[True, True], dtype=DataType.BFloat16, is_cpu=False, stride_order=[1, 0])
    T2 = fd.ops.permute(T1, dims=[1, 0])
    T3 = fd.ops.permute(T0, dims=[2, 1, 0])
    S4 = fd.define_scalar(16, dtype=DataType.Int)
    S5 = fd.define_scalar(32, dtype=DataType.Int)
    V6 = fd.define_vector([S4, S5], dtype=DataType.Int)
    T7 = fd.ops.reshape(T3, new_shape=V6)
    T8 = fd.ops.matmul(T2, T7)
    S9 = fd.define_scalar(16, dtype=DataType.Int)
    S10 = fd.define_scalar(16, dtype=DataType.Int)
    S11 = fd.define_scalar(2, dtype=DataType.Int)
    V12 = fd.define_vector([S9, S10, S11], dtype=DataType.Int)
    T13 = fd.ops.reshape(T8, new_shape=V12)
    T14 = fd.ops.permute(T13, dims=[2, 1, 0])
    fd.add_output(T14)

with FusionDefinition() as fd:
    nvfuser_fusion_id0(fd)

inputs = [
    torch.randn((512,), dtype=torch.bfloat16, device='cuda:0').as_strided((2, 16, 16), (256, 16, 1)),
    torch.randn((256,), dtype=torch.bfloat16, device='cuda:0').as_strided((16, 16), (16, 1)),
]   
fd.execute(inputs)

This fails with the error:

 RuntimeError: h.has_value() INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/fusion_segmenter.cpp":3671, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Can not find a scheduler ...
@Priya2698
Copy link
Collaborator Author

CC: @jacobhinkle @kevinstephano @naoyam

@Priya2698 Priya2698 added Segmentation Issues related to nvFuser Segmentation Top-Down Matmul Dev labels Apr 22, 2024
@zasdfgbnm
Copy link
Collaborator

Probably #1707

@jacobhinkle
Copy link
Collaborator

Probably #1707

Yeah it really resembles it. However, in this case I believe it's a little easier. The segmenter is accepting these two segments:

**Segmenter** Considering fusion:
T2_l[ iS57{16}, iS58{16} ]
   = Set.Permute( T1_g[ iS55{16}, iS56{16} ], cache_op=Streaming )
T5_g[ iS59{16}, iS60{16}, bS17{1} ]
   = broadcast( T2_l[ iS57{16}, iS58{16} ] )

***Accepted*** as: no_op

**Segmenter** Considering fusion:
T3_l[ iS62{16}, iS8{i1}, iS7{i0} ]
   = Set.Permute( T0_g[ iS0{i0}, iS1{i1}, iS61{16} ], cache_op=Streaming )
T11_g[ iS63{16}, iS39{( i1 * i0 )}rf ] = view( T3_l[ iS62{16}, iS8{i1}, iS7{i0} ] )

Scheduler _no_op_ ***rejected*** because : output has a concrete dimension
Scheduler _matmul_ ***rejected*** because : Matmul scheduler supports fusions only with a single mma opor supports a mul-sum pair which can be replaced with a mma op
Scheduler _reduction_ ***rejected*** because : No reduction op to schedule
Scheduler _transpose_ ***rejected*** because : Transpose scheduler does not perform well on small problem sizes.
***Accepted*** as: pointwise

But then it rejects the mma-only segment:

**Segmenter** Considering fusion:
T7_l[ iS64{16}, rS65{16}, iS47{32} ]
   = mma(T5_g[ iS59{16}, iS60{16}, bS17{1} ],
         T6_g[ bS18{1}, iS45{16}, iS46{32} ])
T8_g[ iS66{16}, iS48{32} ]
   = __float2bfloat(T7_l[ iS64{16}, rS65{16}, iS47{32} ]);

Scheduler _no_op_ ***rejected*** because : output has a concrete dimension
Scheduler _matmul_ ***rejected*** because : MmaOp input has unsupported dependency
Scheduler _reduction_ ***rejected*** because : No reduction op to schedule
Scheduler _transpose_ ***rejected*** because : no support for mma ops.
Scheduler _pointwise_ ***rejected*** because : cannot find reference tensor
Scheduler _inner_persistent_ ***rejected*** because : needs a reduction op
Scheduler _outer_persistent_ ***rejected*** because : needs a reduction op
Scheduler _inner_outer_persistent_ ***rejected*** because : needs a reduction op

The issue is here:

const auto areMmaOpInputDependeciesValid = [](const Val* val) {
if (val->definition()->isA<BroadcastOp>()) {
const auto& bcast_inputs = val->definition()->inputs();
// BroadcastOp has single input/output, not need to check other things
return bcast_inputs.front()->isFusionInput() ||
(dynamic_cast<LoadStoreOp*>(bcast_inputs.front()->definition()) !=
nullptr);
}
return false;

We require inputs to be created with BroadcastOp. I don't think we really need that check. We should be able to handle 2D or 3D inputs for 2D matmul problems and just squeeze if necessary in the aten evaluator. cc @protonu

@kevinstephano kevinstephano added bug Something isn't working Matmuls labels Apr 23, 2024
@kevinstephano
Copy link
Collaborator

This bug is dependent on Protonu's Allocation Domain Inference issue #2058.

@kevinstephano
Copy link
Collaborator

We do not need to address this bug when using LinearOp and MatmulOp nodes as they get properly expanded with broadcasts. This still exists if we attempt to consume Einsum use cases.

@Priya2698
Copy link
Collaborator Author

Priya2698 commented May 15, 2024

This example now runs correctly. The segmentation issue is resolved when using the ATen scheduler and MatmulOp nodes (PR #2175, #2209).
Closing this issue which was intended for the fallback ATen path.

We may still need to verify correctness on Matmul Scheduler. Issue #1707 is aimed at a similar issue for the Matmul Scheduler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Matmuls Segmentation Issues related to nvFuser Segmentation Top-Down Matmul Dev
Projects
None yet
Development

No branches or pull requests

5 participants