Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fusion should reject / error out with ambiguous in-place update semantics. #2839

Closed
jjsjann123 opened this issue Aug 23, 2024 · 10 comments
Closed

Comments

@jjsjann123
Copy link
Collaborator

Since we do not handle racing conditions. nvfuser's behavior of inplace update could be affected by segmentation result. A small example would be in below, if we force a segment with segment_set, the output from the fusion changes.

import torch
from nvfuser import FusionDefinition, DataType

def nvfuser_fusion_id1(fd : FusionDefinition) -> None :
    T0 = fd.define_tensor(shape=[-1, -1], contiguity=[True, True], dtype=DataType.Float, is_cpu=False, stride_order=[1, 0])
    T1 = fd.define_tensor(shape=[-1, -1, -1], contiguity=[True, True, True], dtype=DataType.Float, is_cpu=False, stride_order=[2, 1, 0])
    T2 = fd.ops.relu(T0)
    T3 = fd.ops.add(T0, T1)
    fd.add_output(T2, T0)
    fd.add_output(T3)

with FusionDefinition() as fd:
    nvfuser_fusion_id1(fd)

t0 = torch.randn((20,), dtype=torch.float32, device='cuda:0').as_strided((10, 2), (2, 1))
t1 = torch.randn((40,), dtype=torch.float32, device='cuda:0').as_strided((2, 10, 2), (20, 2, 1))

inputs = [t0.clone(), t1.clone()]

o = fd.execute(inputs)[0]

def nvfuser_fusion_id2(fd : FusionDefinition) -> None :
    T0 = fd.define_tensor(shape=[-1, -1], contiguity=[True, True], dtype=DataType.Float, is_cpu=False, stride_order=[1, 0])
    T1 = fd.define_tensor(shape=[-1, -1, -1], contiguity=[True, True, True], dtype=DataType.Float, is_cpu=False, stride_order=[2, 1, 0])
    T2 = fd.ops.relu(T0)
    T0_seg = fd.ops.segment_set(T0)
    T3 = fd.ops.add(T0_seg, T1)
    fd.add_output(T2, T0)
    fd.add_output(T3)

with FusionDefinition() as fd2:
    nvfuser_fusion_id2(fd2)

inputs = [t0.clone(), t1.clone()]

o_ref = fd2.execute(inputs)[0]

print(o - o_ref)

I think this means that any usage of source buffer (T0) that's not leading to the alias target buffer (T2) is ambiguous in nvfuser's definition and we should error out in these scenarios.

But this means that we are going to give off quite some false negative cases and that might be too intrusive. We can try to update this in integration fusion logic.

@jjsjann123
Copy link
Collaborator Author

Linking issue Lightning-AI/lightning-thunder#731 (comment)

@jjsjann123 jjsjann123 self-assigned this Aug 23, 2024
@jacobhinkle
Copy link
Collaborator

Good point. This is kind of a non-broadcast version of #2664. The solution is to have a kind of sync analysis for aliased inputs.

@jjsjann123
Copy link
Collaborator Author

Good point. This is kind of a non-broadcast version of #2664. The solution is to have a kind of sync analysis for aliased inputs.

Yeah, we touched that conversation a couple times.

In this instance, I think it's more of a question on how we want to interpret our API. I've always view nvfuser's fusion definition is building a computational graph, rather than translating a python program.
With the definition like that, the program itself is just ill-defined. Unless we ask user to explicitly build a control edge to establish the sequence of operations...

At the same time, if we are always parsing a python program, maybe the sequence is just implicit and my understanding of the protocol is wrong.

@jjsjann123
Copy link
Collaborator Author

fyi @kiya00 . The check we have in Thunder is on the conservative side and I think our assumption is that the whole fusion is done within a single kernel. Likely the sanity check is not detecting this pattern.

@jacobhinkle
Copy link
Collaborator

I've always view nvfuser's fusion definition is building a computational graph, rather than translating a python program.

I agree but I thought with aliasing we were still SSA, and the guarantee was just that we would place the result back into the aliased input after computation. That way the semantics are unambiguous since we can analyze it as an SSA program, and if the user wants to specify another behavior like using the modified T0 for something, they would just use the output tensor directly (T2 in this case). That way we don't need to error when an aliased input has other uses, we just need to ensure that all loads of T0 occur before their stores i.e. before T2 is stored. This feels like something SyncMap can do, maybe with some updating (see also #2850 which is about rewriting SyncMap to use IdModel).

@kiya00
Copy link

kiya00 commented Aug 28, 2024

Likely the sanity check is not detecting this pattern.

Hi @jjsjann123 , is this going to be fixed in the nvfuser, or I need to update the sanity check to detect this?

@jjsjann123
Copy link
Collaborator Author

I agree but I thought with aliasing we were still SSA, and the guarantee was just that we would place the result back into the aliased input after computation. That way the semantics are unambiguous since we can analyze it as an SSA program, and if the user wants to specify another behavior like using the modified T0 for something, they would just use the output tensor directly.

I agree with you and that's a reasonable protocol. Maybe we should just go with that.

Our code base as-is doesn't abide to that yet.

In the example above, if there is a user program, where input buffer T0 is being updated by with output tensor T2 aliasing T0's buffer.
later use of T0, which is supposed to be using the original value, if segmented into another kernel that executes after the inplace update, then we inevitably uses the updated buffer value and violating the protocol. So we need to do something like copying that buffer for later use...

@jjsjann123
Copy link
Collaborator Author

Likely the sanity check is not detecting this pattern.

Hi @jjsjann123 , is this going to be fixed in the nvfuser, or I need to update the sanity check to detect this?

Haven't got an conclusion on that yet. I don't have any actionable item for you at the moment.

I'm hoping nvfuser would be able to pay some technical debts on in-place support via establishing a protocol and supporting more cases. We might need to update thunder's integration logic, but that's a longer story. I'll keep you updated.

@jjsjann123
Copy link
Collaborator Author

jjsjann123 commented Aug 28, 2024

cc'ing @Priya2698

@jjsjann123
Copy link
Collaborator Author

closing with Priya's recent segmentation pass #2999

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants