Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[torch.compile] fast inductor #11108

Merged
merged 40 commits into from
Dec 17, 2024
Merged

Conversation

youkaichao
Copy link
Member

@youkaichao youkaichao commented Dec 11, 2024

directly bypass aot-autograd and inductor, and load from cache

Signed-off-by: youkaichao <[email protected]>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
vllm/config.py Outdated
@@ -2212,6 +2215,53 @@ class CompilationLevel:
PIECEWISE = 3


class InductorHashCache:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tried to place this class into vllm.compilation.backends , but then needs to be lazily imported, and pydantic will complain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not put into a separate file?

@youkaichao
Copy link
Member Author

for torch.compile with warm start, i.e. we run it once to wam up the compilation cache, and then reuse the compilation cache:

before this pr (main branch):

$ vllm serve meta-llama/Meta-Llama-3-8B --disable-log-requests -O "{'level': 3, 'candidate_compile_sizes': [1, 2]}"
Dynamo bytecode transform time: 4.62 s
Compiling a graph for general shape takes 14.72 s
Compiling a graph for shape 2 takes 14.63 s
Compiling a graph for shape 1 takes 11.01 s
torch.compile takes 44.98 s in total

this pr:

$ vllm serve meta-llama/Meta-Llama-3-8B --disable-log-requests -O "{'level': 3, 'candidate_compile_sizes': [1, 2]}"
Dynamo bytecode transform time: 4.58 s
Compiling a graph for general shape takes 2.76 s
Compiling a graph for shape 2 takes 0.44 s
Compiling a graph for shape 1 takes 1.50 s
torch.compile takes 9.29 s in total

should be close to optimal now.

Signed-off-by: youkaichao <[email protected]>
@youkaichao youkaichao marked this pull request as ready for review December 11, 2024 21:32
@youkaichao
Copy link
Member Author

youkaichao commented Dec 11, 2024

now, even if we compile for all sizes, the compilation time becomes negligible:

$ vllm serve meta-llama/Meta-Llama-3-8B --disable-log-requests -O "{'level': 3, 'candidate_compile_sizes': [$(seq -s, 1 1 256)]}"
INFO 12-11 14:04:25 backends.py:363] Dynamo bytecode transform time: 4.59 s
INFO 12-11 14:04:28 backends.py:155] Compiling a graph for general shape takes 2.77 s
INFO 12-11 14:04:32 backends.py:158] Compiling a graph for shape 256 takes 0.48 s
INFO 12-11 14:04:33 backends.py:158] Compiling a graph for shape 248 takes 0.48 s
INFO 12-11 14:04:34 backends.py:158] Compiling a graph for shape 240 takes 0.49 s
INFO 12-11 14:04:35 backends.py:158] Compiling a graph for shape 232 takes 0.37 s
INFO 12-11 14:04:36 backends.py:158] Compiling a graph for shape 224 takes 0.38 s
INFO 12-11 14:04:37 backends.py:158] Compiling a graph for shape 216 takes 0.47 s
INFO 12-11 14:04:37 backends.py:158] Compiling a graph for shape 208 takes 0.32 s
INFO 12-11 14:04:38 backends.py:158] Compiling a graph for shape 200 takes 0.47 s
INFO 12-11 14:04:39 backends.py:158] Compiling a graph for shape 192 takes 0.39 s
INFO 12-11 14:04:39 backends.py:158] Compiling a graph for shape 184 takes 0.34 s
INFO 12-11 14:04:40 backends.py:158] Compiling a graph for shape 176 takes 0.36 s
INFO 12-11 14:04:41 backends.py:158] Compiling a graph for shape 168 takes 0.49 s
INFO 12-11 14:04:42 backends.py:158] Compiling a graph for shape 160 takes 0.52 s
INFO 12-11 14:04:43 backends.py:158] Compiling a graph for shape 152 takes 0.47 s
INFO 12-11 14:04:44 backends.py:158] Compiling a graph for shape 144 takes 0.41 s
INFO 12-11 14:04:44 backends.py:158] Compiling a graph for shape 136 takes 0.33 s
INFO 12-11 14:04:45 backends.py:158] Compiling a graph for shape 128 takes 0.53 s
INFO 12-11 14:04:46 backends.py:158] Compiling a graph for shape 120 takes 0.33 s
INFO 12-11 14:04:47 backends.py:158] Compiling a graph for shape 112 takes 0.48 s
INFO 12-11 14:04:48 backends.py:158] Compiling a graph for shape 104 takes 0.51 s
INFO 12-11 14:04:48 backends.py:158] Compiling a graph for shape 96 takes 0.53 s
INFO 12-11 14:04:49 backends.py:158] Compiling a graph for shape 88 takes 0.54 s
INFO 12-11 14:04:50 backends.py:158] Compiling a graph for shape 80 takes 0.52 s
INFO 12-11 14:04:51 backends.py:158] Compiling a graph for shape 72 takes 0.59 s
INFO 12-11 14:04:52 backends.py:158] Compiling a graph for shape 64 takes 0.57 s
INFO 12-11 14:04:53 backends.py:158] Compiling a graph for shape 56 takes 0.51 s
INFO 12-11 14:04:54 backends.py:158] Compiling a graph for shape 48 takes 0.42 s
INFO 12-11 14:04:55 backends.py:158] Compiling a graph for shape 40 takes 0.56 s
INFO 12-11 14:04:56 backends.py:158] Compiling a graph for shape 32 takes 0.44 s
INFO 12-11 14:04:57 backends.py:158] Compiling a graph for shape 24 takes 0.47 s
INFO 12-11 14:04:57 backends.py:158] Compiling a graph for shape 16 takes 0.47 s
INFO 12-11 14:04:58 backends.py:158] Compiling a graph for shape 8 takes 0.47 s
INFO 12-11 14:04:59 backends.py:158] Compiling a graph for shape 4 takes 0.52 s
INFO 12-11 14:05:00 backends.py:158] Compiling a graph for shape 2 takes 0.51 s
INFO 12-11 14:05:02 backends.py:158] Compiling a graph for shape 1 takes 1.43 s
INFO 12-11 14:05:02 monitor.py:31] torch.compile takes 24.54 s in total

Now that we can directly cache inductor compilation, we don't need to use piecewise compilation anymore:

vllm serve meta-llama/Meta-Llama-3-8B --disable-log-requests -O "{'level': 3, 'candidate_compile_sizes': [$(seq -s, 1 1 256)], 'splitting_ops': []}"
INFO 12-11 22:13:19 backends.py:371] Dynamo bytecode transform time: 4.81 s
INFO 12-11 22:13:22 backends.py:163] Compiling a graph for general shape takes 0.68 s
INFO 12-11 22:13:26 backends.py:166] Compiling a graph for shape 256 takes 0.41 s
INFO 12-11 22:13:27 backends.py:166] Compiling a graph for shape 248 takes 0.41 s
INFO 12-11 22:13:28 backends.py:166] Compiling a graph for shape 240 takes 0.48 s
INFO 12-11 22:13:29 backends.py:166] Compiling a graph for shape 232 takes 0.46 s
INFO 12-11 22:13:30 backends.py:166] Compiling a graph for shape 224 takes 0.47 s
INFO 12-11 22:13:31 backends.py:166] Compiling a graph for shape 216 takes 0.39 s
INFO 12-11 22:13:32 backends.py:166] Compiling a graph for shape 208 takes 0.44 s
INFO 12-11 22:13:32 backends.py:166] Compiling a graph for shape 200 takes 0.36 s
INFO 12-11 22:13:33 backends.py:166] Compiling a graph for shape 192 takes 0.41 s
INFO 12-11 22:13:34 backends.py:166] Compiling a graph for shape 184 takes 0.40 s
INFO 12-11 22:13:35 backends.py:166] Compiling a graph for shape 176 takes 0.46 s
INFO 12-11 22:13:36 backends.py:166] Compiling a graph for shape 168 takes 0.36 s
INFO 12-11 22:13:36 backends.py:166] Compiling a graph for shape 160 takes 0.30 s
INFO 12-11 22:13:37 backends.py:166] Compiling a graph for shape 152 takes 0.41 s
INFO 12-11 22:13:38 backends.py:166] Compiling a graph for shape 144 takes 0.50 s
INFO 12-11 22:13:39 backends.py:166] Compiling a graph for shape 136 takes 0.45 s
INFO 12-11 22:13:40 backends.py:166] Compiling a graph for shape 128 takes 0.47 s
INFO 12-11 22:13:41 backends.py:166] Compiling a graph for shape 120 takes 0.47 s
INFO 12-11 22:13:41 backends.py:166] Compiling a graph for shape 112 takes 0.43 s
INFO 12-11 22:13:42 backends.py:166] Compiling a graph for shape 104 takes 0.51 s
INFO 12-11 22:13:43 backends.py:166] Compiling a graph for shape 96 takes 0.57 s
INFO 12-11 22:13:44 backends.py:166] Compiling a graph for shape 88 takes 0.42 s
INFO 12-11 22:13:45 backends.py:166] Compiling a graph for shape 80 takes 0.44 s
INFO 12-11 22:13:46 backends.py:166] Compiling a graph for shape 72 takes 0.56 s
INFO 12-11 22:13:47 backends.py:166] Compiling a graph for shape 64 takes 0.39 s
INFO 12-11 22:13:48 backends.py:166] Compiling a graph for shape 56 takes 0.41 s
INFO 12-11 22:13:49 backends.py:166] Compiling a graph for shape 48 takes 0.39 s
INFO 12-11 22:13:50 backends.py:166] Compiling a graph for shape 40 takes 0.50 s
INFO 12-11 22:13:50 backends.py:166] Compiling a graph for shape 32 takes 0.43 s
INFO 12-11 22:13:51 backends.py:166] Compiling a graph for shape 24 takes 0.53 s
INFO 12-11 22:13:52 backends.py:166] Compiling a graph for shape 16 takes 0.34 s
INFO 12-11 22:13:53 backends.py:166] Compiling a graph for shape 8 takes 0.52 s
INFO 12-11 22:13:54 backends.py:166] Compiling a graph for shape 4 takes 0.52 s
INFO 12-11 22:13:55 backends.py:166] Compiling a graph for shape 2 takes 0.45 s
INFO 12-11 22:13:56 backends.py:166] Compiling a graph for shape 1 takes 0.68 s
INFO 12-11 22:13:56 monitor.py:31] torch.compile takes 21.25 s in total

Therefore, I decided to remove piecewise compile for v0, while still keeping piecewise compile for v1. cc @bnellnm

@youkaichao
Copy link
Member Author

cc @ProExpertProg @bnellnm

@youkaichao
Copy link
Member Author

throughput benchmark:

# run baseline first, to find out the number of scheduler steps to keep gpus busy
$ python benchmarks/benchmark_throughput.py --dataset ShareGPT_V3_unfiltered_cleaned_split.json --model meta-llama/Meta-Llama-3-8B
Throughput: 30.52 requests/s, 12620.07 total tokens/s, 6052.88 output tokens/s

$ python benchmarks/benchmark_throughput.py --dataset ShareGPT_V3_unfiltered_cleaned_split.json --model meta-llama/Meta-Llama-3-8B --num-scheduler-steps 8
Throughput: 43.57 requests/s, 18018.38 total tokens/s, 8642.04 output tokens/s

$ python benchmarks/benchmark_throughput.py --dataset ShareGPT_V3_unfiltered_cleaned_split.json --model meta-llama/Meta-Llama-3-8B --num-scheduler-steps 10
Throughput: 44.04 requests/s, 18212.79 total tokens/s, 8735.28 output tokens/s

$ python benchmarks/benchmark_throughput.py --dataset ShareGPT_V3_unfiltered_cleaned_split.json --model meta-llama/Meta-Llama-3-8B --num-scheduler-steps 12
Throughput: 44.65 requests/s, 18464.38 total tokens/s, 8855.95 output tokens/s

$ python benchmarks/benchmark_throughput.py --dataset ShareGPT_V3_unfiltered_cleaned_split.json --model meta-llama/Meta-Llama-3-8B --num-scheduler-steps 16
Throughput: 45.03 requests/s, 18622.54 total tokens/s, 8931.81 output tokens/s

$ python benchmarks/benchmark_throughput.py --dataset ShareGPT_V3_unfiltered_cleaned_split.json --model meta-llama/Meta-Llama-3-8B --num-scheduler-steps 20
Throughput: 44.18 requests/s, 18269.12 total tokens/s, 8762.30 output tokens/s

# the best number of scheduler step is 16, run this setting with `torch.compile`
python benchmarks/benchmark_throughput.py --dataset ShareGPT_V3_unfiltered_cleaned_split.json --model meta-llama/Meta-Llama-3-8B --num-scheduler-steps 16 -O "{'level': 3, 'candidate_compile_sizes': [$(seq -s, 1 1 256)]}"
Throughput: 46.93 requests/s, 19405.53 total tokens/s, 9307.35 output tokens/s

Now we don't need to profile the batchsize distribution anymore. We can just compile for all sizes.

Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
@youkaichao
Copy link
Member Author

I think there might be something wrong inside Inductor when I try llama 3 70B model (with tp 4). Even when I directly cache the graph hash, every shape takes 90 seconds to load the graph.

I will stop here as the results on llama 3 8B are quite good, compilation time is greatly reduced. The llama 3 70B case should be a bug to fix in the future.

vllm/config.py Outdated
if self.model_config is not None and \
not self.compilation_config.cache_dir:
# generate a cache directory based on the model information
# TODO: consider more factors that will affect model forward,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I missed some quantization args that can affect model execution, but I don't know how to pull out all factors that affect quantization.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vLLM version? We can add the git SHA to the key

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be a large source of potential bugs so definitely should be careful here. Most quantization related stuff from NM goes in the model_config but there's a lot of arguments to LLM that can affect things like dtype and quantization. Are these in the key already?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not yet. that's why I want to ask for reviews.

one direction is we consider all factors affecting compilation, and we can use compilation cache by default.

another approach is we don't cache by default, but tell user the cache directory, and users can specify the cache directory if they know nothing changed.

which one would you prefer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should always check the known factors when we cache, and expose an accessible switch for enabling/disabling caching. And then it's less important whether it's on by default or not. And for that decision @robertgshaw2-neuralmagic should chime in.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added more factors to consider in 2a7f729 . Let me know if I miss anything.

vllm/config.py Outdated
self.cache_dir = os.path.join(
self.cache_dir, f"rank_{vllm_config.parallel_config.rank}")
os.makedirs(self.cache_dir, exist_ok=True)
self.inductor_hash_cache_path = os.path.join(self.cache_dir,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be better if we also save a serialized form of the config, but we need to design the serialized format.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which config is not serializable? Isn't CompilationConfig serializable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is serializable, but i want a human-readable form, so that we can also manually check the config.

Copy link
Contributor

@ProExpertProg ProExpertProg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two main notes:

  • Figuring out what the "key" is for the cache should be a method on config, likely on each sub-config as well. That way it's clear when developers modify config they might need to modify the key as well.
  • I think InductorHashCache is a nice abstraction but it shouldn't live inside config (both file- and structure-wise). It should also take on more responsibilities, I think we can make it cleaner, so that wrap_inductor doesn't contain caching logic - it just calls the appropriate methods on the cache (or cache manager) object

vllm/config.py Outdated
@@ -2212,6 +2215,53 @@ class CompilationLevel:
PIECEWISE = 3


class InductorHashCache:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not put into a separate file?

vllm/config.py Outdated Show resolved Hide resolved
vllm/config.py Outdated
self.cache_dir = os.path.join(
self.cache_dir, f"rank_{vllm_config.parallel_config.rank}")
os.makedirs(self.cache_dir, exist_ok=True)
self.inductor_hash_cache_path = os.path.join(self.cache_dir,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which config is not serializable? Isn't CompilationConfig serializable?

vllm/config.py Outdated Show resolved Hide resolved
vllm/compilation/backends.py Show resolved Hide resolved
vllm/compilation/backends.py Outdated Show resolved Hide resolved
vllm/compilation/backends.py Outdated Show resolved Hide resolved
vllm/compilation/backends.py Outdated Show resolved Hide resolved
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
@@ -71,6 +71,7 @@
VLLM_USE_V1: bool = False
VLLM_ENABLE_V1_MULTIPROCESSING: bool = False
VLLM_LOG_BATCHSIZE_INTERVAL: float = -1
VLLM_DISABLE_COMPILE_CACHE: bool = False
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the flag to disable the compile cache. the compile cache is used by default.

Comment on lines +51 to +59
# set flags so that Inductor and Triton store their cache
# in the cache_dir, then users only need to copy the cache_dir
# to another machine to reuse the cache.
inductor_cache = os.path.join(cache_dir, "inductor_cache")
os.makedirs(inductor_cache, exist_ok=True)
os.environ["TORCHINDUCTOR_CACHE_DIR"] = inductor_cache
triton_cache = os.path.join(cache_dir, "triton_cache")
os.makedirs(triton_cache, exist_ok=True)
os.environ["TRITON_CACHE_DIR"] = triton_cache
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redirect inductor/triton cache to the vllm cache location.

@youkaichao
Copy link
Member Author

In the future, I plan to dump more information in the cache directory, including:

  • the transformed bytecode from Dynamo
  • the human-readable fx graph after every transform
  • the human-readable fx graph that inductor compiles
  • the human-readable executable python file inductor finally run

the goal is to make torch.compile crystally clear, we can know the exact correspondence of the compiled code and our original code, so that we can easily debug any potential correctness / performance issues.

Copy link
Contributor

@ProExpertProg ProExpertProg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Tyler's comments about config caching, otherwise LGTM!

Signed-off-by: youkaichao <[email protected]>
Comment on lines +2610 to +2611
factors.append(self.inductor_compile_config)
factors.append(self.inductor_passes)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about candidate_compile_sizes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they do not affect the computation graph.

say we compile for candidate_compile_sizes = [1, 2, 4] , and then run again with candidate_compile_sizes = [1, 2, 4, 8] , we want them to share the same cache directory, so that we can directly load the cache for [1, 2, 4] , and only compile for a new shape 8 .

Copy link

mergify bot commented Dec 14, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @youkaichao.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Dec 14, 2024
Copy link
Collaborator

@tlrmchlsmth tlrmchlsmth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me now! Some suggestions on comments and logging, but LGTM otherwise

vllm/config.py Outdated Show resolved Hide resolved
vllm/compilation/backends.py Show resolved Hide resolved
vllm/compilation/backends.py Show resolved Hide resolved
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
@youkaichao youkaichao enabled auto-merge (squash) December 16, 2024 22:42
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 16, 2024
@youkaichao youkaichao disabled auto-merge December 17, 2024 00:15
@youkaichao youkaichao merged commit 88a412e into vllm-project:main Dec 17, 2024
50 of 53 checks passed
@youkaichao youkaichao deleted the fast_inductor branch December 17, 2024 00:15
BKitor pushed a commit to BKitor/vllm that referenced this pull request Dec 30, 2024
Signed-off-by: youkaichao <[email protected]>
Co-authored-by: Tyler Michael Smith <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants