-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A graph-based pipeline splitting #1080
Conversation
d976072
to
85cd0e9
Compare
Nice work! Could you please provide some readings/explanations regarding this implementation? |
This is an original work but the ideas are fairly well-known. Perhaps the closest implementation are works here and here. A common objective for most pipeline parallelism strategies is to split the model into stages of (roughly) the same size so as to minimize the required communication. The difference is how to define the "size of a stage" and the "required communication". My implementation relies on the existing knobs provided by the pippy tracer (e.g., via ModelSplit._analyze_node_size). Roughly speaking, the size of a stage is the sum of the memory required for storing parameters/buffers of the corresponding operators. The communication between the stages is estimated via the sizes of the output activations. Given the estimations, it is easy to formulate the optimization problem as a MILP. All the constraints/objective are fairly straightforward; the linked papers describe very similar formulations. Algorithm-wise, the formulated MILP is solved via SciPy, which is typically available or could be installed. It is not a perfect solver but works reasonably well on my benchmarks. |
Thanks for the explanation! I might try implementing this in a similar way for my own pipeline parallelism, this code really helps a lot ;) |
The latest commit makes the computation much faster (e.g., ~2sec on gpt2) by "pre-solving" the instance and assigning some nodes to specific stages. The code now tested and works on the latest pippy rev. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks much for pulling in the algorithm!
nit: can you fix the lint error? Thanks!
New measurements on the latest revision:
The last column shows the runtime of the splitting algorithm, that is, the overhead of the approach over manual/autosplit policies. The delay is (roughly) proportional to the size of the computation graph; in this benchmark, the graphs contain between 1K and 4K nodes. |
An automatic graph-based pipeline splitting algorithm. The goal of the method is to split the computation graph into stages to minimize the communication between the stages while trying to balance the computation. The optimization is done via solving a mixed-integer linear program (MILP) using
scipy
.Measuring mean batch time in sec over 50 batches (after a warmup) for various models using "manual split", "--autosplit", and the new "--graphsplit":
That is, the results of graph-split are almost identical to manual splitting, indicating that no manual model annotation is needed.