Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A graph-based pipeline splitting #1080

Merged
merged 3 commits into from
May 31, 2024
Merged

Conversation

spupyrev
Copy link
Contributor

@spupyrev spupyrev commented Apr 22, 2024

An automatic graph-based pipeline splitting algorithm. The goal of the method is to split the computation graph into stages to minimize the communication between the stages while trying to balance the computation. The optimization is done via solving a mixed-integer linear program (MILP) using scipy.

Measuring mean batch time in sec over 50 batches (after a warmup) for various models using "manual split", "--autosplit", and the new "--graphsplit":

model ngpus manual (old-)autosplit (new-)autohsplit algo time
pippy_bert 2 0.0876 0.1030 0.0851 0.2sec
pippy_bert 4 0.0513 0.0590 0.0495 1.0sec
pippy_gpt2 2 0.1031 0.1316 0.0997 0.5sec
pippy_gpt2 4 0.0572 0.0820 0.0547 2.7sec
pippy_fnet 2 0.0661 0.0938 0.0664 0.2sec
pippy_fnet 4 0.0359 crash 0.0351 0.8sec
pippy_blenderbot 2 0.4853 0.5014 0.4863 0.6sec
pippy_blenderbot 4 0.2470 0.2630 0.2479 3.1sec
pippy_electra 2 0.1085 0.1385 0.1028 0.2sec
pippy_electra 4 0.0607 crash 0.0544 1.1sec

That is, the results of graph-split are almost identical to manual splitting, indicating that no manual model annotation is needed.

@spupyrev
Copy link
Contributor Author

Tests and measurements are done on an older rev (467dc1b), since the latest has some tracing issues (e.g., #1087)

@spupyrev spupyrev marked this pull request as ready for review April 26, 2024 18:46
examples/huggingface/pippy_gpt2.py Show resolved Hide resolved
pippy/graphsplit.py Show resolved Hide resolved
pippy/graphsplit.py Show resolved Hide resolved
pippy/graphsplit.py Show resolved Hide resolved
pippy/graphsplit.py Show resolved Hide resolved
pippy/graphsplit.py Show resolved Hide resolved
@botbw
Copy link

botbw commented Apr 30, 2024

Nice work! Could you please provide some readings/explanations regarding this implementation?

@spupyrev
Copy link
Contributor Author

Nice work! Could you please provide some readings/explanations regarding this implementation?

This is an original work but the ideas are fairly well-known. Perhaps the closest implementation are works here and here.

A common objective for most pipeline parallelism strategies is to split the model into stages of (roughly) the same size so as to minimize the required communication. The difference is how to define the "size of a stage" and the "required communication". My implementation relies on the existing knobs provided by the pippy tracer (e.g., via ModelSplit._analyze_node_size). Roughly speaking, the size of a stage is the sum of the memory required for storing parameters/buffers of the corresponding operators. The communication between the stages is estimated via the sizes of the output activations. Given the estimations, it is easy to formulate the optimization problem as a MILP. All the constraints/objective are fairly straightforward; the linked papers describe very similar formulations.
(I do realize that my model is likely not perfect and ignores some details; I opted however for the simplicity of the implementation, possibly at the cost of some perf. We can revisit details in the future)

Algorithm-wise, the formulated MILP is solved via SciPy, which is typically available or could be installed. It is not a perfect solver but works reasonably well on my benchmarks.

@botbw
Copy link

botbw commented May 1, 2024

Nice work! Could you please provide some readings/explanations regarding this implementation?

This is an original work but the ideas are fairly well-known. Perhaps the closest implementation are works here and here.

A common objective for most pipeline parallelism strategies is to split the model into stages of (roughly) the same size so as to minimize the required communication. The difference is how to define the "size of a stage" and the "required communication". My implementation relies on the existing knobs provided by the pippy tracer (e.g., via ModelSplit._analyze_node_size). Roughly speaking, the size of a stage is the sum of the memory required for storing parameters/buffers of the corresponding operators. The communication between the stages is estimated via the sizes of the output activations. Given the estimations, it is easy to formulate the optimization problem as a MILP. All the constraints/objective are fairly straightforward; the linked papers describe very similar formulations. (I do realize that my model is likely not perfect and ignores some details; I opted however for the simplicity of the implementation, possibly at the cost of some perf. We can revisit details in the future)

Algorithm-wise, the formulated MILP is solved via SciPy, which is typically available or could be installed. It is not a perfect solver but works reasonably well on my benchmarks.

Thanks for the explanation! I might try implementing this in a similar way for my own pipeline parallelism, this code really helps a lot ;)

@spupyrev
Copy link
Contributor Author

The latest commit makes the computation much faster (e.g., ~2sec on gpt2) by "pre-solving" the instance and assigning some nodes to specific stages. The code now tested and works on the latest pippy rev.

Copy link
Contributor

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks much for pulling in the algorithm!
nit: can you fix the lint error? Thanks!

@spupyrev
Copy link
Contributor Author

spupyrev commented May 31, 2024

New measurements on the latest revision:

model ngpus manual (old-)autosplit (new-)autohsplit algo time
pippy_bert 2 0.0876 0.1030 0.0851 0.2sec
pippy_bert 4 0.0513 0.0590 0.0495 1.0sec
pippy_gpt2 2 0.1031 0.1316 0.0997 0.5sec
pippy_gpt2 4 0.0572 0.0820 0.0547 2.7sec
pippy_fnet 2 0.0661 0.0938 0.0664 0.2sec
pippy_fnet 4 0.0359 crash 0.0351 0.8sec
pippy_blenderbot 2 0.4853 0.5014 0.4863 0.6sec
pippy_blenderbot 4 0.2470 0.2630 0.2479 3.1sec
pippy_electra 2 0.1085 0.1385 0.1028 0.2sec
pippy_electra 4 0.0607 crash 0.0544 1.1sec

The last column shows the runtime of the splitting algorithm, that is, the overhead of the approach over manual/autosplit policies. The delay is (roughly) proportional to the size of the computation graph; in this benchmark, the graphs contain between 1K and 4K nodes.

@spupyrev spupyrev merged commit 5e1d719 into pytorch:main May 31, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants