Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory inefficient order of execution with multiple parallel paths #3290

Open
IvanYashchuk opened this issue Oct 28, 2024 · 4 comments
Open
Assignees
Labels

Comments

@IvanYashchuk
Copy link
Collaborator

IvanYashchuk commented Oct 28, 2024

Here's the reproducer to see the effect that nvFuser uses more memory than the equivalent PyTorch function (on a224936) which prints:

nvFuser Peak memory diff: 2.75 MiB
  Eager Peak memory diff: 0.75 MiB
import torch
import gc
from nvfuser import FusionDefinition, DataType

N_PARALLEL_PATHS = 10

with FusionDefinition() as fd:
    T0s = [fd.define_tensor(shape=[256, 256], contiguity=[True, True], dtype=DataType.Float, is_cpu=False, stride_order=[1, 0]) for _ in range(N_PARALLEL_PATHS)]
    a = fd.define_tensor(shape=[256, 256], contiguity=[True, True], dtype=DataType.Float, is_cpu=False, stride_order=[1, 0])
    for T0 in T0s:
        T1 = fd.ops.relu(T0)
        T2 = fd.ops.matmul(T1, T1)
        T3 = fd.ops.relu(T2)
        a = fd.ops.matmul(T3, a)
    fd.add_output(a)

t0s = [torch.randn(256, 256, device="cuda") for _ in range(N_PARALLEL_PATHS)] # 0.25 MiB * N_PARALLEL_PATHS
a = torch.randn(256, 256, device="cuda") # 0.25 MiB

# Record peak memory usage
fd.execute([*t0s, a])
gc.collect(0)
torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats()
before_in_MiB = torch.cuda.max_memory_allocated() / (1024 * 1024)
fd.execute([*t0s, a])
max_allocated_in_MiB = torch.cuda.max_memory_allocated() / (1024 * 1024) - before_in_MiB
print(f"nvFuser Peak memory diff: {max_allocated_in_MiB:.2f} MiB")

def eager_func(t0s, a):
    for t0 in t0s:
        t1 = torch.nn.functional.relu(t0); del t0
        t2 = torch.matmul(t1, t1); del t1
        t3 = torch.nn.functional.relu(t2); del t2
        a = torch.matmul(t3, a); del t3
    return a

# # Record peak memory usage
eager_func(t0s, a)
gc.collect(0)
torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats()
before_in_MiB = torch.cuda.max_memory_allocated() / (1024 * 1024)
eager_func(t0s, a)
max_allocated_in_MiB = torch.cuda.max_memory_allocated() / (1024 * 1024) - before_in_MiB
print(f"  Eager Peak memory diff: {max_allocated_in_MiB:.2f} MiB")

This was discovered while working on a similar problem in Thunder and trying to send the whole example computational graph to nvFuser (Lightning-AI/lightning-thunder#1337).

@kevinstephano
Copy link
Collaborator

kevinstephano commented Oct 29, 2024

Issue with sorting of segments? Maybe need a heuristic. Do we need to check the order of operators from Thunder?

@kevinstephano
Copy link
Collaborator

Is there any progress on this issue?

@jacobhinkle
Copy link
Collaborator

jacobhinkle commented Nov 5, 2024

I looked into our options. The general problem of optimizing the order of execution to minimize memory for a general segmented fusion DAG doesn't have an efficient solution, however some special cases present opportunities. A good reference for this is Kayaaslan et al. 2018. Summary points:

  • If the DAG is a forest of in-trees then we can optimize the ordering using the Liu algorithm (summarized in Kayaaslan 2018) which is pretty simple. This condition is equivalent to saying that every intermediate tensor is used by only one consumer segment. The example repro in this issue satisfies this condition so we could very quickly compute an optimal ordering for it. However, it's easy to modify the fusion so that this is not the case, for example by adding a common tensor to all of the inputs before using them in the matmuls.
  • For small DAGs we can brute-force a solution. This is a little bit more efficient than factorial(num_segments) because we can use Kahn's algorithm to DFS the space of topological orderings, and we can skip some inefficient orderings in the same way hill-valley segmentation is used in Liu's algorithm. I am testing an implementation of this locally.
  • The mentioned paper Kayaaslan 2018 covers a broad class of DAGs called series-parallel graphs. These are formed by a sequence of concatenations (series compositions) and parallel merges of subnetworks; a tree can be formed of these operations for an SP graph and the algorithm for optimizing works recursively by optimizing the order of subtrees then merging them when needed. For "parallel chains" subgraphs, this algorithm uses two invocations of Liu's algorithm, so implementing Liu's algorithm is a stepping stone to covering SP graphs.
    • Caveat: I am not 100% confident that their cost model is exactly what we have in practice. It appears to be a "multiple data" model where each producer node produces data specifically for each consumer task, meaning the data are represented as edges. In a "single data" model, producers produce data that is then used by possibly multiple consumers, as in our case. It may be that optimizing the multiple-data case also optimizes the single-data case; I'm not sure. If not, then we should consider this a possibly suboptimal algorithm but still a useful heuristic.
  • When the DAG is not an SP DAG and is not small enough to brute force, then we must solve a nearby problem by first converting the DAG to an SP-DAG by removing edges while preserving the transitive ordering relations in the original DAG. This is a process called SP-ization. Multiple algorithms for this exist like Gonzalez-Escribano, Gemund, and Cardenoso-Payo 2003. The algorithm is actually a bit more complicated than the SP-DAG ordering algorithm, but I think this would be a good approach for general DAGs when a perfect solution like Liu or brute-force is not available.

In summary, I think an ideal solution would do the following:

  1. DFS to check whether we can use Liu's algorithm to get an optimal ordering
  2. If not, then run brute force search for a fixed amount of time like 10 ms. If we are able to explore all possibly optimal orderings in that time, return the result.
  3. If we could not find an optimal ordering by steps 1. or 2., then compute an approximately optimal ordering by SP-izing the DAG then optimizing the ordering of the resulting SP DAG. We can then compare that peak memory use to the one found by the partial brute force search and return the ordering that takes the minimum amount of peak memory.

Note that steps 1 and 2 are relatively simpler to implement than step 3 which really involves implementing two algorithms.

@jacobhinkle
Copy link
Collaborator

I should mention that I also experimented with simple greedy modifications to our current algorithm to try and encourage consumer segments to appear next to their producer segments, but it's still easy to do simple modifications to the repro in this issue and wind up with the suboptimal solution in those cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants