-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bind sharded input/output tensors with DID-parallelized allocation domains. #3282
Comments
Here's my initial idea and a list of proposed changes:
Update on 11/04: I'll try to change 1 and 4 to add another API to FusionExecutorCache that takes tensors matching allocation domains. If works, this will avoid having to change all C++ tests that set allocation domains. cc @naoyam and @jjsjann123 It feels like a lot of work. But if it is what it takes to fix this problem, I'm willing to give that a shot. |
Makes sense to me. |
SGTM as well. Regarding point 4 here
A heads up here that there might be some obsolete logic in permutation in our runtime that's left from the TorchScript days. I'll try to remove that. |
Nitpicking on the implementation details, should this part be on cpp side instead? i.e. we still have cpp user/tests where we need to ensure that we have correct semantics from the returned results. |
I like not having to change cpp users/tests, but I haven't found an easy way to do this all on C++ side. Where it broke is that allocation domain is the only construct on C++ side to represent memory format. We don't know whether a particular allocation domain is dictated by stride order or device split or anything else in the future. There may be a fragile way of doing so by pattern-matching the logical=>allocation transforms. If logical=>allocation is a permutation, assume it's done by stride order and convert the allocated output tensor back to logical; otherwise, assume device split and keep it consistent with allocation. Would that be a good intermediate step? |
(My braindump) I think the root cause is that at::Tensor, being used in the FusionExecutorCache::runFusionWithInputs interface, can't always represent both the logical domain and allocation in a TensorView. Currently, with stride order being the only use case, at::Tensor barely represents the logical domain in its sizes and allocation in strides, which is fine. However, with the new multi-GPU use case mentioned in OP, the logical=>allocation transforms become more complicated than what strides can represent. This is not only a problem for multi-GPU. Consider binding an
The storage will look like [0,3,1,4,2,5], and the logical view has to remain [0,1,2,3,4,5] to match the semantics of There are three general directions that we can consider:
|
QQ: What does |
It means to bind at::Tensor to allocation domain as proposed here and accept the UX limitation. |
to transformFromAllocationToLogical. This method is general enough to work on input tensors as well. For #3282.
Thanks for the write-up, it makes sense to me in general but will need more time to process all the details. While I'm reading, I'm having a couple of questions:
Do you mean we need to apply tranforms to retrieve the logical domain back from the Aten shape? With that, I agree
|
I should have defined "match" clearly. A logical domain (or any tensor domain) is a list of
Yes. @naoyam reminded me today that this is not even always possible with uneven splits. For that reason, I'm now leaning towards always taking the logical tensor (however it can be on device meta) and optionally the allocation tensor. Will run some experiments to verify the ideas... |
It may be also possible to work around by not having logical sizes. One primary use of logical sizes is predicating each expression. However, for pointwise ops, we can just eliminate predicates. Some threads may use data that's out of logical boundaries, but that should be fine as long it isn't propagated to final outputs. Non-pointwise ops do need more careful processing. For example, sum reductions should be fine as long as the buffer is initialized to zero. Overall, however, I'm not sure how this WAR would work generally. It may easily hit limitations. |
Ok I understand now what you mean by "match", thanks for explaining. However, I am not sure to understand:
Can you point to a concrete example?
I would be in favor of not considering (or even forbidding) indivisible splits for now to fix ideas. That sounds like a reasonable assumption. But anyway, it should always be possible to play the transform backward even for indivisible splits. TensorView* tv0 = makeContigTensor(1); // of symbolic extent [i1]
fusion.addInput(tv0);
tv0->split(/*axis=*/0, /*factor=*/8); // now of symbolic extents [ceilDivision(i1, 8), 8]
// [...]
at::Tensor concrete_input = at::rand({4, 8}); Then, binding the concrete input to infer Of course this way of doing wouldn't support wild corner-case scenario like involving a symbolic extent in two different indivisible splits, but that sounds like a reasonable restriction. Is there a problem with this approach? Wdyt? |
Could you elaborate this?
Did you mean we just assume |
I guess this is what I mean. However, I am not sure what you mean by "it is actually 31". Since the extent is symbolic, how can we say what it "actually is", and what does it mean? |
The shape of an input/output at::Tensor matches logical, even when the corresponding TensorView has an non-trivial allocation domain. See
Fuser/csrc/expr_evaluator.cpp
Line 165 in 2e4ed54
Fuser/csrc/runtime/allocations.cpp
Lines 769 to 770 in 85c22a2
This will be problematic for multi-GPU when we start to parallelize loop domain. Say the input TV has
logical=[D*S]
andallocation=[DID{D},S]
. D=number of devices, and S=local tensor size. We can't allocate the input to have the entire logical shape.After discussing this with @jjsjann123, the most natural solution is to let FusionExecutorCache expect input tensors to match allocation domains and allocate output tensors according to allocation domains. For example, in the case above, the input tensor will have shape
[1,S]
becauseD
in the allocation domain is parallelized.One challenge is to not break the stride_order support needed by Thunder. I think we'll need extra permutations in addition to translating stride order to allocation domain.
The text was updated successfully, but these errors were encountered: