-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory inefficient order of execution with multiple parallel paths #3290
Comments
Issue with sorting of segments? Maybe need a heuristic. Do we need to check the order of operators from Thunder? |
Is there any progress on this issue? |
I looked into our options. The general problem of optimizing the order of execution to minimize memory for a general segmented fusion DAG doesn't have an efficient solution, however some special cases present opportunities. A good reference for this is Kayaaslan et al. 2018. Summary points:
In summary, I think an ideal solution would do the following:
Note that steps 1 and 2 are relatively simpler to implement than step 3 which really involves implementing two algorithms. |
I should mention that I also experimented with simple greedy modifications to our current algorithm to try and encourage consumer segments to appear next to their producer segments, but it's still easy to do simple modifications to the repro in this issue and wind up with the suboptimal solution in those cases. |
Here's the reproducer to see the effect that nvFuser uses more memory than the equivalent PyTorch function (on a224936) which prints:
This was discovered while working on a similar problem in Thunder and trying to send the whole example computational graph to nvFuser (Lightning-AI/lightning-thunder#1337).
The text was updated successfully, but these errors were encountered: