Segmentation failure in `matmul + reshape` fusion #2127

Priya2698 · 2024-04-22T20:30:42Z

Repro from @jjsjann123: Lightning-AI/lightning-thunder#207 (comment)

import torch
from nvfuser import FusionDefinition, DataType

def nvfuser_fusion_id0(fd : FusionDefinition) -> None :
    T0 = fd.define_tensor(shape=[-1, -1, -1], contiguity=[True, True, True], dtype=DataType.BFloat16, is_cpu=False, stride_order=[2, 1, 0])
    T1 = fd.define_tensor(shape=[-1, -1], contiguity=[True, True], dtype=DataType.BFloat16, is_cpu=False, stride_order=[1, 0])
    T2 = fd.ops.permute(T1, dims=[1, 0])
    T3 = fd.ops.permute(T0, dims=[2, 1, 0])
    S4 = fd.define_scalar(16, dtype=DataType.Int)
    S5 = fd.define_scalar(32, dtype=DataType.Int)
    V6 = fd.define_vector([S4, S5], dtype=DataType.Int)
    T7 = fd.ops.reshape(T3, new_shape=V6)
    T8 = fd.ops.matmul(T2, T7)
    S9 = fd.define_scalar(16, dtype=DataType.Int)
    S10 = fd.define_scalar(16, dtype=DataType.Int)
    S11 = fd.define_scalar(2, dtype=DataType.Int)
    V12 = fd.define_vector([S9, S10, S11], dtype=DataType.Int)
    T13 = fd.ops.reshape(T8, new_shape=V12)
    T14 = fd.ops.permute(T13, dims=[2, 1, 0])
    fd.add_output(T14)

with FusionDefinition() as fd:
    nvfuser_fusion_id0(fd)

inputs = [
    torch.randn((512,), dtype=torch.bfloat16, device='cuda:0').as_strided((2, 16, 16), (256, 16, 1)),
    torch.randn((256,), dtype=torch.bfloat16, device='cuda:0').as_strided((16, 16), (16, 1)),
]   
fd.execute(inputs)

This fails with the error:

 RuntimeError: h.has_value() INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/fusion_segmenter.cpp":3671, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Can not find a scheduler ...

The text was updated successfully, but these errors were encountered:

Priya2698 · 2024-04-22T20:30:58Z

CC: @jacobhinkle @kevinstephano @naoyam

zasdfgbnm · 2024-04-22T22:52:27Z

Probably #1707

jacobhinkle · 2024-04-23T00:30:25Z

Probably #1707

Yeah it really resembles it. However, in this case I believe it's a little easier. The segmenter is accepting these two segments:

**Segmenter** Considering fusion:
T2_l[ iS57{16}, iS58{16} ]
   = Set.Permute( T1_g[ iS55{16}, iS56{16} ], cache_op=Streaming )
T5_g[ iS59{16}, iS60{16}, bS17{1} ]
   = broadcast( T2_l[ iS57{16}, iS58{16} ] )

***Accepted*** as: no_op

**Segmenter** Considering fusion:
T3_l[ iS62{16}, iS8{i1}, iS7{i0} ]
   = Set.Permute( T0_g[ iS0{i0}, iS1{i1}, iS61{16} ], cache_op=Streaming )
T11_g[ iS63{16}, iS39{( i1 * i0 )}rf ] = view( T3_l[ iS62{16}, iS8{i1}, iS7{i0} ] )

Scheduler _no_op_ ***rejected*** because : output has a concrete dimension
Scheduler _matmul_ ***rejected*** because : Matmul scheduler supports fusions only with a single mma opor supports a mul-sum pair which can be replaced with a mma op
Scheduler _reduction_ ***rejected*** because : No reduction op to schedule
Scheduler _transpose_ ***rejected*** because : Transpose scheduler does not perform well on small problem sizes.
***Accepted*** as: pointwise

But then it rejects the mma-only segment:

**Segmenter** Considering fusion:
T7_l[ iS64{16}, rS65{16}, iS47{32} ]
   = mma(T5_g[ iS59{16}, iS60{16}, bS17{1} ],
         T6_g[ bS18{1}, iS45{16}, iS46{32} ])
T8_g[ iS66{16}, iS48{32} ]
   = __float2bfloat(T7_l[ iS64{16}, rS65{16}, iS47{32} ]);

Scheduler _no_op_ ***rejected*** because : output has a concrete dimension
Scheduler _matmul_ ***rejected*** because : MmaOp input has unsupported dependency
Scheduler _reduction_ ***rejected*** because : No reduction op to schedule
Scheduler _transpose_ ***rejected*** because : no support for mma ops.
Scheduler _pointwise_ ***rejected*** because : cannot find reference tensor
Scheduler _inner_persistent_ ***rejected*** because : needs a reduction op
Scheduler _outer_persistent_ ***rejected*** because : needs a reduction op
Scheduler _inner_outer_persistent_ ***rejected*** because : needs a reduction op

The issue is here:

Fuser/csrc/scheduler/matmul_utils.cpp

Lines 275 to 283 in 0382e80

    
           const auto areMmaOpInputDependeciesValid = [](const Val* val) { 
        
             if (val->definition()->isA<BroadcastOp>()) { 
        
               const auto& bcast_inputs = val->definition()->inputs(); 
        
               // BroadcastOp has single input/output, not need to check other things 
        
               return bcast_inputs.front()->isFusionInput() || 
        
                   (dynamic_cast<LoadStoreOp*>(bcast_inputs.front()->definition()) != 
        
                    nullptr); 
        
             } 
        
             return false;

We require inputs to be created with BroadcastOp. I don't think we really need that check. We should be able to handle 2D or 3D inputs for 2D matmul problems and just squeeze if necessary in the aten evaluator. cc @protonu

kevinstephano · 2024-04-30T18:22:46Z

This bug is dependent on Protonu's Allocation Domain Inference issue #2058.

kevinstephano · 2024-04-30T18:33:30Z

We do not need to address this bug when using LinearOp and MatmulOp nodes as they get properly expanded with broadcasts. This still exists if we attempt to consume Einsum use cases.

Priya2698 · 2024-05-15T23:00:39Z

This example now runs correctly. The segmentation issue is resolved when using the ATen scheduler and MatmulOp nodes (PR #2175, #2209).
Closing this issue which was intended for the fallback ATen path.

We may still need to verify correctness on Matmul Scheduler. Issue #1707 is aimed at a similar issue for the Matmul Scheduler.

Priya2698 added Segmentation Issues related to nvFuser Segmentation Top-Down Matmul Dev labels Apr 22, 2024

kevinstephano added bug Something isn't working Matmuls labels Apr 23, 2024

kevinstephano assigned protonu Apr 30, 2024

Priya2698 mentioned this issue May 9, 2024

Write a sharded transformer block in nvFuser API. #2199

Closed

Priya2698 closed this as completed May 15, 2024

Priya2698 mentioned this issue May 15, 2024

Update nvFuser matmul Lightning-AI/lightning-thunder#419

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation failure in `matmul + reshape` fusion #2127

Segmentation failure in `matmul + reshape` fusion #2127

Priya2698 commented Apr 22, 2024

Priya2698 commented Apr 22, 2024

zasdfgbnm commented Apr 22, 2024

jacobhinkle commented Apr 23, 2024

kevinstephano commented Apr 30, 2024

kevinstephano commented Apr 30, 2024

Priya2698 commented May 15, 2024 •

edited

Loading

Segmentation failure in matmul + reshape fusion #2127

Segmentation failure in matmul + reshape fusion #2127

Comments

Priya2698 commented Apr 22, 2024

Priya2698 commented Apr 22, 2024

zasdfgbnm commented Apr 22, 2024

jacobhinkle commented Apr 23, 2024

kevinstephano commented Apr 30, 2024

kevinstephano commented Apr 30, 2024

Priya2698 commented May 15, 2024 • edited Loading

Segmentation failure in `matmul + reshape` fusion #2127

Segmentation failure in `matmul + reshape` fusion #2127

Priya2698 commented May 15, 2024 •

edited

Loading