Bind sharded input/output tensors with DID-parallelized allocation domains. #3282

wujingyue · 2024-10-25T21:40:46Z

The shape of an input/output at::Tensor matches logical, even when the corresponding TensorView has an non-trivial allocation domain. See

Fuser/csrc/expr_evaluator.cpp

Line 165 in 2e4ed54

auto logical_domain = TensorDomain::noReductions(tv->getLogicalDomain());

for inputs and

Fuser/csrc/runtime/allocations.cpp

Lines 769 to 770 in 85c22a2

    
           meta_tensor = 
        
               transformOutputFromAllocationToLogical(meta_tensor, tv, expr_eval);

for outputs.

This will be problematic for multi-GPU when we start to parallelize loop domain. Say the input TV has logical=[D*S] and allocation=[DID{D},S]. D=number of devices, and S=local tensor size. We can't allocate the input to have the entire logical shape.

After discussing this with @jjsjann123, the most natural solution is to let FusionExecutorCache expect input tensors to match allocation domains and allocate output tensors according to allocation domains. For example, in the case above, the input tensor will have shape [1,S] because D in the allocation domain is parallelized.

One challenge is to not break the stride_order support needed by Thunder. I think we'll need extra permutations in addition to translating stride order to allocation domain.

The text was updated successfully, but these errors were encountered:

wujingyue · 2024-10-28T23:07:39Z

Here's my initial idea and a list of proposed changes:

The input tensor fed to FusionExecutorCache::runFusionWithInputs need to match the corresponding input TensorView's allocation domain instead of logical. This means we'll change all C++ tests that set allocation domains of fusion inputs/outputs.
When binding a TensorView to a tensor, transform the tensor's sizes/strides from mapping allocation to mapping logical and then bind the logical domain. Part of this can be done using this helper:

Fuser/csrc/runtime/allocations.cpp

Line 688 in f097e58

at::Tensor transformOutputFromAllocationToLogical(

. Binding the logical domain is necessary because the logical domain usually dominates the allocation domain -- binding the allocation domain alone will leave some extents Val* unbound.
When allocating an output tensor, allocate it according to the allocation domain instead of logical. For example, a TensorView with logical domain of [i0,i1,i2] and allocation domain of [i2,i1,i0] will be allocated as shape [i2,i1,i0]. Currently, inferShapeOfOutput does that already however it converts the shape to logical domain after that:

Fuser/csrc/runtime/allocations.cpp

Line 770 in f097e58

transformOutputFromAllocationToLogical(meta_tensor, tv, expr_eval);

.
FusionDefinition::execute permutes input/output by stride order if the stride order is specified. This way, we hide this change from Python users, which customize allocation domain only via stride order. It's unclear to me whether FusionDefinition knows which inputs/outputs are stride-ordered. We may need to collect that from recording_state_. Note: intermediate fd.ops.stride_orders will remain the same -- they get translated to a set with an allocation domain attached to the output. If they sit on segment boundaries, they'll be allocated differently (again logical => allocation), but the outside of the fusion definition will see the same behavior.

Update on 11/04: I'll try to change 1 and 4 to add another API to FusionExecutorCache that takes tensors matching allocation domains. If works, this will avoid having to change all C++ tests that set allocation domains.

cc @naoyam and @jjsjann123

It feels like a lot of work. But if it is what it takes to fix this problem, I'm willing to give that a shot.

naoyam · 2024-10-29T15:53:58Z

Makes sense to me.

jjsjann123 · 2024-10-29T16:00:15Z

SGTM as well.

Regarding point 4 here

FusionDefinition::execute permutes input/output by stride order if the stride order is specified. This way, we hide this change from Python users, which customize allocation domain only via stride order. It's unclear to me whether FusionDefinition knows which inputs/outputs are stride-ordered. We may need to collect that from recording_state_. Note: intermediate fd.ops.stride_orders will remain the same -- they get translated to a set with an allocation domain attached to the output. If they sit on segment boundaries, they'll be allocated differently (again logical => allocation), but the outside of the fusion definition will see the same behavior.

A heads up here that there might be some obsolete logic in permutation in our runtime that's left from the TorchScript days. I'll try to remove that.

jjsjann123 · 2024-10-29T16:04:51Z

FusionDefinition::execute permutes input/output by stride order if the stride order is specified.

Nitpicking on the implementation details, should this part be on cpp side instead? i.e. we still have cpp user/tests where we need to ensure that we have correct semantics from the returned results.

wujingyue · 2024-10-29T17:59:29Z

Nitpicking on the implementation details, should this part be on cpp side instead? i.e. we still have cpp user/tests where we need to ensure that we have correct semantics from the returned results.

I like not having to change cpp users/tests, but I haven't found an easy way to do this all on C++ side. Where it broke is that allocation domain is the only construct on C++ side to represent memory format. We don't know whether a particular allocation domain is dictated by stride order or device split or anything else in the future.

There may be a fragile way of doing so by pattern-matching the logical=>allocation transforms. If logical=>allocation is a permutation, assume it's done by stride order and convert the allocated output tensor back to logical; otherwise, assume device split and keep it consistent with allocation. Would that be a good intermediate step?

wujingyue · 2024-10-30T19:28:12Z

(My braindump)

I think the root cause is that at::Tensor, being used in the FusionExecutorCache::runFusionWithInputs interface, can't always represent both the logical domain and allocation in a TensorView.

Currently, with stride order being the only use case, at::Tensor barely represents the logical domain in its sizes and allocation in strides, which is fine. However, with the new multi-GPU use case mentioned in OP, the logical=>allocation transforms become more complicated than what strides can represent.

This is not only a problem for multi-GPU. Consider binding an arange(6) to the following TensorView:

logical = [i0{6}]
i1{2}, i2{3} = split(i0{6})
i3{6} = merge(i2{3}, i1{2})
allocation = [i3{6}]

The storage will look like [0,3,1,4,2,5], and the logical view has to remain [0,1,2,3,4,5] to match the semantics of arange. Using at::Tensor, there's no way to set the strides to give us that logical view.

There are three general directions that we can consider:

Stick with using at::Tensor in the interface and accept its limitations. So far, we've been talking about letting at::Tensor match the logical domain or the allocation domain. Matching the logical domain means it wouldn't be able to represent a sharded tensor, as pointed out in OP. Matching the allocation domain can work but leads to a bad user experience: the user of FusionExecutorCache would have to reconstruct the logical view for the framework.
Let runFusionWithInputs consume/produce a "rich" tensor that can represent both logical and allocation. I'm afraid it's hard to cover all cases; we would have to reimplement the same types of transforms for concrete tensors, e.g., split and merge.
Similar to 2, but make the tensor rich enough to barely support certain use cases. For example, we could build something similar to DTensor that only allow device split, not arbitrary transforms. We would probably have to fork the API to consume different tensor representations, e.g., DTensor vs local tensor with stride orders, etc.

jjsjann123 · 2024-10-30T20:20:17Z

Stick with using at::Tensor in the interface and accept its limitations.

QQ: What does accept its limitations mean? that we'll just not be able to support loop DID parallelization?

wujingyue · 2024-10-30T20:47:58Z

QQ: What does accept its limitations mean? that we'll just not be able to support loop DID parallelization?

It means to bind at::Tensor to allocation domain as proposed here and accept the UX limitation.

to transformFromAllocationToLogical. This method is general enough to work on input tensors as well. For #3282.

samnordmann · 2024-11-05T20:46:19Z

Thanks for the write-up, it makes sense to me in general but will need more time to process all the details. While I'm reading, I'm having a couple of questions:

Currently, at::Tensor is already not matching the logical domain, since a sharded dimension is of size 1 on the at::Tensor and not on the logical domain. This needs to be like that, otherwise we couldn't accept sharded I/O. Do you agree with this, and is this what you have in mind in your item "1"?
what do you mean by:

Matching the allocation domain can work but leads to a bad user experience: the user of FusionExecutorCache would have to reconstruct the logical view for the framework

Do you mean we need to apply tranforms to retrieve the logical domain back from the Aten shape? With that, I agree

I don't understand what you mean by "rich tensor". Could make this idea more precise? I don't see how nor why using something else than at::Tensor

wujingyue · 2024-11-05T21:12:43Z

since a sharded dimension is of size 1 on the at::Tensor and not on the logical domain.

I should have defined "match" clearly. A logical domain (or any tensor domain) is a list of IterDomains, each of which can be parallelized or not. A parallelized IterDomain and a tensor dimension size of 1 are considered to match. The extent of that IterDomain however will be bound to the size of the mesh along the corresponding dimension. Under this definition, at::Tensor today matches logical but not allocation.

Do you mean we need to apply tranforms to retrieve the logical domain back from the Aten shape? With that, I agree

Yes. @naoyam reminded me today that this is not even always possible with uneven splits. For that reason, I'm now leaning towards always taking the logical tensor (however it can be on device meta) and optionally the allocation tensor. Will run some experiments to verify the ideas...

naoyam · 2024-11-05T21:21:47Z

Yes. @naoyam reminded me today that this is not even always possible with uneven splits. For that reason, I'm now leaning towards always taking the logical tensor (however it can be on device meta) and optionally the allocation tensor. Will run some experiments to verify the ideas...

It may be also possible to work around by not having logical sizes. One primary use of logical sizes is predicating each expression. However, for pointwise ops, we can just eliminate predicates. Some threads may use data that's out of logical boundaries, but that should be fine as long it isn't propagated to final outputs. Non-pointwise ops do need more careful processing. For example, sum reductions should be fine as long as the buffer is initialized to zero.

Overall, however, I'm not sure how this WAR would work generally. It may easily hit limitations.

samnordmann · 2024-11-06T23:38:47Z

since a sharded dimension is of size 1 on the at::Tensor and not on the logical domain.

I should have defined "match" clearly. A logical domain (or any tensor domain) is a list of IterDomains, each of which can be parallelized or not. A parallelized IterDomain and a tensor dimension size of 1 are considered to match. The extent of that IterDomain however will be bound to the size of the mesh along the corresponding dimension.

Ok I understand now what you mean by "match", thanks for explaining. However, I am not sure to understand:

Under this definition, at::Tensor today matches logical but not allocation.

Can you point to a concrete example?
[Edit] maybe my question boils down to understanding in what case, currently, the logical domain doesn't match the allocation domain.

Do you mean we need to apply tranforms to retrieve the logical domain back from the Aten shape? With that, I agree

Yes. @naoyam reminded me today that this is not even always possible with uneven splits. For that reason, I'm now leaning towards always taking the logical tensor (however it can be on device meta) and optionally the allocation tensor. Will run some experiments to verify the ideas...

I would be in favor of not considering (or even forbidding) indivisible splits for now to fix ideas. That sounds like a reasonable assumption.

But anyway, it should always be possible to play the transform backward even for indivisible splits. ceilDiv is not a one-to-one mapping, so it is not invertible, yet it shouldn't matter as we can always choose the "canonical" preimage that make the division even. Let me illustrate with an example:

TensorView* tv0 = makeContigTensor(1); // of symbolic extent [i1]
fusion.addInput(tv0);

tv0->split(/*axis=*/0, /*factor=*/8); // now of symbolic extents [ceilDivision(i1, 8), 8] 

// [...]

at::Tensor concrete_input = at::rand({4, 8});

Then, binding the concrete input to infer i1 is not a one-to-one map, in the sense that all integers in the range [8*3+1, 8*4] are licite choices for i1 ; however we can choose the "canonical" value i1=8*4 which avoids indivisible splits.

Of course this way of doing wouldn't support wild corner-case scenario like involving a symbolic extent in two different indivisible splits, but that sounds like a reasonable restriction.

Is there a problem with this approach? Wdyt?

naoyam · 2024-11-07T01:40:33Z

Could you elaborate this?

however we can choose the "canonical" value i1=8*4 which avoids indivisible splits.

Did you mean we just assume i1 is 32 even when it's actually, for example, 31?

samnordmann · 2024-11-07T16:43:11Z

Could you elaborate this?

however we can choose the "canonical" value i1=8*4 which avoids indivisible splits.

Did you mean we just assume i1 is 32 even when it's actually, for example, 31?

I guess this is what I mean. However, I am not sure what you mean by "it is actually 31". Since the extent is symbolic, how can we say what it "actually is", and what does it mean?

wujingyue self-assigned this Oct 25, 2024

wujingyue mentioned this issue Oct 25, 2024

DID parallelization of loop domain #2563

Open

kevinstephano added the Multi-GPU label Oct 30, 2024

wujingyue changed the title ~~Consume/Produce sharded input/output tensors with loop DID parallelization.~~ Bind sharded input/output tensors with DID-parallelized allocation domains. Nov 4, 2024

wujingyue added a commit that referenced this issue Nov 4, 2024

Rename transformOutputFromAllocationToLogical

1cd675f

to transformFromAllocationToLogical. This method is general enough to work on input tensors as well. For #3282.

wujingyue mentioned this issue Nov 4, 2024

Rename transformOutputFromAllocationToLogical #3341

Merged

wujingyue linked a pull request Nov 5, 2024 that will close this issue

Permute the inputs/outputs of runFusionWithInputs. #3342

Draft

wujingyue mentioned this issue Nov 5, 2024

Permute the inputs/outputs of runFusionWithInputs. #3342

Draft

This was referenced Nov 7, 2024

[Proposal] Support root->logical transforms in Fusion inputs #3366

Closed

Validate allocation sizes and strides more widely. #3375

Merged

wujingyue added a commit that referenced this issue Nov 18, 2024

Add a repro for #3282

28719fc

wujingyue mentioned this issue Nov 19, 2024

Unshard tensor sizes before binding. #3444

Merged

wujingyue added a commit that referenced this issue Nov 21, 2024

Add a repro for #3282

ef70987

wujingyue added a commit that referenced this issue Nov 25, 2024

Add a repro for #3282

67c40a0

wujingyue added a commit that referenced this issue Nov 26, 2024

Add a repro for #3282

04e06a8

wujingyue closed this as completed in #3444 Nov 30, 2024

wujingyue closed this as completed in c154e90 Nov 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bind sharded input/output tensors with DID-parallelized allocation domains. #3282

Bind sharded input/output tensors with DID-parallelized allocation domains. #3282

wujingyue commented Oct 25, 2024

wujingyue commented Oct 28, 2024 •

edited

Loading

naoyam commented Oct 29, 2024

jjsjann123 commented Oct 29, 2024

jjsjann123 commented Oct 29, 2024

wujingyue commented Oct 29, 2024

wujingyue commented Oct 30, 2024 •

edited

Loading

jjsjann123 commented Oct 30, 2024

wujingyue commented Oct 30, 2024

samnordmann commented Nov 5, 2024

wujingyue commented Nov 5, 2024

naoyam commented Nov 5, 2024

samnordmann commented Nov 6, 2024 •

edited

Loading

naoyam commented Nov 7, 2024

samnordmann commented Nov 7, 2024

Bind sharded input/output tensors with DID-parallelized allocation domains. #3282

Bind sharded input/output tensors with DID-parallelized allocation domains. #3282

Comments

wujingyue commented Oct 25, 2024

wujingyue commented Oct 28, 2024 • edited Loading

naoyam commented Oct 29, 2024

jjsjann123 commented Oct 29, 2024

jjsjann123 commented Oct 29, 2024

wujingyue commented Oct 29, 2024

wujingyue commented Oct 30, 2024 • edited Loading

jjsjann123 commented Oct 30, 2024

wujingyue commented Oct 30, 2024

samnordmann commented Nov 5, 2024

wujingyue commented Nov 5, 2024

naoyam commented Nov 5, 2024

samnordmann commented Nov 6, 2024 • edited Loading

naoyam commented Nov 7, 2024

samnordmann commented Nov 7, 2024

wujingyue commented Oct 28, 2024 •

edited

Loading

wujingyue commented Oct 30, 2024 •

edited

Loading

samnordmann commented Nov 6, 2024 •

edited

Loading