A graph-based pipeline splitting #1080

spupyrev · 2024-04-22T17:36:18Z

An automatic graph-based pipeline splitting algorithm. The goal of the method is to split the computation graph into stages to minimize the communication between the stages while trying to balance the computation. The optimization is done via solving a mixed-integer linear program (MILP) using scipy.

Measuring mean batch time in sec over 50 batches (after a warmup) for various models using "manual split", "--autosplit", and the new "--graphsplit":

model	ngpus	manual	(old-)autosplit	(new-)autohsplit	algo time
pippy_bert	2	0.0876	0.1030	0.0851	0.2sec
pippy_bert	4	0.0513	0.0590	0.0495	1.0sec
pippy_gpt2	2	0.1031	0.1316	0.0997	0.5sec
pippy_gpt2	4	0.0572	0.0820	0.0547	2.7sec
pippy_fnet	2	0.0661	0.0938	0.0664	0.2sec
pippy_fnet	4	0.0359	crash	0.0351	0.8sec
pippy_blenderbot	2	0.4853	0.5014	0.4863	0.6sec
pippy_blenderbot	4	0.2470	0.2630	0.2479	3.1sec
pippy_electra	2	0.1085	0.1385	0.1028	0.2sec
pippy_electra	4	0.0607	crash	0.0544	1.1sec

That is, the results of graph-split are almost identical to manual splitting, indicating that no manual model annotation is needed.

spupyrev · 2024-04-26T18:45:16Z

Tests and measurements are done on an older rev (467dc1b), since the latest has some tracing issues (e.g., #1087)

examples/huggingface/pippy_gpt2.py

pippy/graphsplit.py

botbw · 2024-04-30T09:19:08Z

Nice work! Could you please provide some readings/explanations regarding this implementation?

spupyrev · 2024-04-30T23:31:37Z

Nice work! Could you please provide some readings/explanations regarding this implementation?

This is an original work but the ideas are fairly well-known. Perhaps the closest implementation are works here and here.

A common objective for most pipeline parallelism strategies is to split the model into stages of (roughly) the same size so as to minimize the required communication. The difference is how to define the "size of a stage" and the "required communication". My implementation relies on the existing knobs provided by the pippy tracer (e.g., via ModelSplit._analyze_node_size). Roughly speaking, the size of a stage is the sum of the memory required for storing parameters/buffers of the corresponding operators. The communication between the stages is estimated via the sizes of the output activations. Given the estimations, it is easy to formulate the optimization problem as a MILP. All the constraints/objective are fairly straightforward; the linked papers describe very similar formulations.
(I do realize that my model is likely not perfect and ignores some details; I opted however for the simplicity of the implementation, possibly at the cost of some perf. We can revisit details in the future)

Algorithm-wise, the formulated MILP is solved via SciPy, which is typically available or could be installed. It is not a perfect solver but works reasonably well on my benchmarks.

botbw · 2024-05-01T06:28:46Z

Nice work! Could you please provide some readings/explanations regarding this implementation?

This is an original work but the ideas are fairly well-known. Perhaps the closest implementation are works here and here.

A common objective for most pipeline parallelism strategies is to split the model into stages of (roughly) the same size so as to minimize the required communication. The difference is how to define the "size of a stage" and the "required communication". My implementation relies on the existing knobs provided by the pippy tracer (e.g., via ModelSplit._analyze_node_size). Roughly speaking, the size of a stage is the sum of the memory required for storing parameters/buffers of the corresponding operators. The communication between the stages is estimated via the sizes of the output activations. Given the estimations, it is easy to formulate the optimization problem as a MILP. All the constraints/objective are fairly straightforward; the linked papers describe very similar formulations. (I do realize that my model is likely not perfect and ignores some details; I opted however for the simplicity of the implementation, possibly at the cost of some perf. We can revisit details in the future)

Algorithm-wise, the formulated MILP is solved via SciPy, which is typically available or could be installed. It is not a perfect solver but works reasonably well on my benchmarks.

Thanks for the explanation! I might try implementing this in a similar way for my own pipeline parallelism, this code really helps a lot ;)

spupyrev · 2024-05-17T15:01:54Z

The latest commit makes the computation much faster (e.g., ~2sec on gpt2) by "pre-solving" the instance and assigning some nodes to specific stages. The code now tested and works on the latest pippy rev.

kwen2501

LGTM! Thanks much for pulling in the algorithm!
nit: can you fix the lint error? Thanks!

spupyrev · 2024-05-31T21:27:19Z

New measurements on the latest revision:

model	ngpus	manual	(old-)autosplit	(new-)autohsplit	algo time
pippy_bert	2	0.0876	0.1030	0.0851	0.2sec
pippy_bert	4	0.0513	0.0590	0.0495	1.0sec
pippy_gpt2	2	0.1031	0.1316	0.0997	0.5sec
pippy_gpt2	4	0.0572	0.0820	0.0547	2.7sec
pippy_fnet	2	0.0661	0.0938	0.0664	0.2sec
pippy_fnet	4	0.0359	crash	0.0351	0.8sec
pippy_blenderbot	2	0.4853	0.5014	0.4863	0.6sec
pippy_blenderbot	4	0.2470	0.2630	0.2479	3.1sec
pippy_electra	2	0.1085	0.1385	0.1028	0.2sec
pippy_electra	4	0.0607	crash	0.0544	1.1sec

The last column shows the runtime of the splitting algorithm, that is, the overhead of the approach over manual/autosplit policies. The delay is (roughly) proportional to the size of the computation graph; in this benchmark, the graphs contain between 1K and 4K nodes.

facebook-github-bot added the cla signed label Apr 22, 2024

spupyrev requested a review from kwen2501 April 22, 2024 17:38

spupyrev force-pushed the graph-split branch 4 times, most recently from d976072 to 85cd0e9 Compare April 26, 2024 18:44

spupyrev requested review from lessw2020, H-Huang and wconstab April 26, 2024 18:45

spupyrev marked this pull request as ready for review April 26, 2024 18:46

paryxyt reviewed Apr 29, 2024

View reviewed changes

spupyrev force-pushed the graph-split branch from 85cd0e9 to cfe2b31 Compare May 17, 2024 14:56

kwen2501 approved these changes May 23, 2024

View reviewed changes

A graph-based pipeline splitting

d83ef61

spupyrev force-pushed the graph-split branch from cfe2b31 to 1b435b0 Compare May 29, 2024 17:35

Added graph-split presolve to speedup computation

6403d71

spupyrev force-pushed the graph-split branch from 1b435b0 to 6403d71 Compare May 29, 2024 17:41

Adjusted node merging

5f62c99

spupyrev merged commit 5e1d719 into pytorch:main May 31, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A graph-based pipeline splitting #1080

A graph-based pipeline splitting #1080

spupyrev commented Apr 22, 2024 •

edited by kwen2501

Loading

spupyrev commented Apr 26, 2024

botbw commented Apr 30, 2024

spupyrev commented Apr 30, 2024

botbw commented May 1, 2024

spupyrev commented May 17, 2024

kwen2501 left a comment

spupyrev commented May 31, 2024 •

edited by kwen2501

Loading

A graph-based pipeline splitting #1080

A graph-based pipeline splitting #1080

Conversation

spupyrev commented Apr 22, 2024 • edited by kwen2501 Loading

spupyrev commented Apr 26, 2024

botbw commented Apr 30, 2024

spupyrev commented Apr 30, 2024

botbw commented May 1, 2024

spupyrev commented May 17, 2024

kwen2501 left a comment

Choose a reason for hiding this comment

spupyrev commented May 31, 2024 • edited by kwen2501 Loading

spupyrev commented Apr 22, 2024 •

edited by kwen2501

Loading

spupyrev commented May 31, 2024 •

edited by kwen2501

Loading