-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[torch.compile] fast inductor #11108
Conversation
Signed-off-by: youkaichao <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
vllm/config.py
Outdated
@@ -2212,6 +2215,53 @@ class CompilationLevel: | |||
PIECEWISE = 3 | |||
|
|||
|
|||
class InductorHashCache: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tried to place this class into vllm.compilation.backends
, but then needs to be lazily imported, and pydantic
will complain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not put into a separate file?
for before this pr (main branch): $ vllm serve meta-llama/Meta-Llama-3-8B --disable-log-requests -O "{'level': 3, 'candidate_compile_sizes': [1, 2]}"
Dynamo bytecode transform time: 4.62 s
Compiling a graph for general shape takes 14.72 s
Compiling a graph for shape 2 takes 14.63 s
Compiling a graph for shape 1 takes 11.01 s
torch.compile takes 44.98 s in total this pr: $ vllm serve meta-llama/Meta-Llama-3-8B --disable-log-requests -O "{'level': 3, 'candidate_compile_sizes': [1, 2]}"
Dynamo bytecode transform time: 4.58 s
Compiling a graph for general shape takes 2.76 s
Compiling a graph for shape 2 takes 0.44 s
Compiling a graph for shape 1 takes 1.50 s
torch.compile takes 9.29 s in total should be close to optimal now. |
Signed-off-by: youkaichao <[email protected]>
now, even if we compile for all sizes, the compilation time becomes negligible: $ vllm serve meta-llama/Meta-Llama-3-8B --disable-log-requests -O "{'level': 3, 'candidate_compile_sizes': [$(seq -s, 1 1 256)]}"
INFO 12-11 14:04:25 backends.py:363] Dynamo bytecode transform time: 4.59 s
INFO 12-11 14:04:28 backends.py:155] Compiling a graph for general shape takes 2.77 s
INFO 12-11 14:04:32 backends.py:158] Compiling a graph for shape 256 takes 0.48 s
INFO 12-11 14:04:33 backends.py:158] Compiling a graph for shape 248 takes 0.48 s
INFO 12-11 14:04:34 backends.py:158] Compiling a graph for shape 240 takes 0.49 s
INFO 12-11 14:04:35 backends.py:158] Compiling a graph for shape 232 takes 0.37 s
INFO 12-11 14:04:36 backends.py:158] Compiling a graph for shape 224 takes 0.38 s
INFO 12-11 14:04:37 backends.py:158] Compiling a graph for shape 216 takes 0.47 s
INFO 12-11 14:04:37 backends.py:158] Compiling a graph for shape 208 takes 0.32 s
INFO 12-11 14:04:38 backends.py:158] Compiling a graph for shape 200 takes 0.47 s
INFO 12-11 14:04:39 backends.py:158] Compiling a graph for shape 192 takes 0.39 s
INFO 12-11 14:04:39 backends.py:158] Compiling a graph for shape 184 takes 0.34 s
INFO 12-11 14:04:40 backends.py:158] Compiling a graph for shape 176 takes 0.36 s
INFO 12-11 14:04:41 backends.py:158] Compiling a graph for shape 168 takes 0.49 s
INFO 12-11 14:04:42 backends.py:158] Compiling a graph for shape 160 takes 0.52 s
INFO 12-11 14:04:43 backends.py:158] Compiling a graph for shape 152 takes 0.47 s
INFO 12-11 14:04:44 backends.py:158] Compiling a graph for shape 144 takes 0.41 s
INFO 12-11 14:04:44 backends.py:158] Compiling a graph for shape 136 takes 0.33 s
INFO 12-11 14:04:45 backends.py:158] Compiling a graph for shape 128 takes 0.53 s
INFO 12-11 14:04:46 backends.py:158] Compiling a graph for shape 120 takes 0.33 s
INFO 12-11 14:04:47 backends.py:158] Compiling a graph for shape 112 takes 0.48 s
INFO 12-11 14:04:48 backends.py:158] Compiling a graph for shape 104 takes 0.51 s
INFO 12-11 14:04:48 backends.py:158] Compiling a graph for shape 96 takes 0.53 s
INFO 12-11 14:04:49 backends.py:158] Compiling a graph for shape 88 takes 0.54 s
INFO 12-11 14:04:50 backends.py:158] Compiling a graph for shape 80 takes 0.52 s
INFO 12-11 14:04:51 backends.py:158] Compiling a graph for shape 72 takes 0.59 s
INFO 12-11 14:04:52 backends.py:158] Compiling a graph for shape 64 takes 0.57 s
INFO 12-11 14:04:53 backends.py:158] Compiling a graph for shape 56 takes 0.51 s
INFO 12-11 14:04:54 backends.py:158] Compiling a graph for shape 48 takes 0.42 s
INFO 12-11 14:04:55 backends.py:158] Compiling a graph for shape 40 takes 0.56 s
INFO 12-11 14:04:56 backends.py:158] Compiling a graph for shape 32 takes 0.44 s
INFO 12-11 14:04:57 backends.py:158] Compiling a graph for shape 24 takes 0.47 s
INFO 12-11 14:04:57 backends.py:158] Compiling a graph for shape 16 takes 0.47 s
INFO 12-11 14:04:58 backends.py:158] Compiling a graph for shape 8 takes 0.47 s
INFO 12-11 14:04:59 backends.py:158] Compiling a graph for shape 4 takes 0.52 s
INFO 12-11 14:05:00 backends.py:158] Compiling a graph for shape 2 takes 0.51 s
INFO 12-11 14:05:02 backends.py:158] Compiling a graph for shape 1 takes 1.43 s
INFO 12-11 14:05:02 monitor.py:31] torch.compile takes 24.54 s in total Now that we can directly cache inductor compilation, we don't need to use piecewise compilation anymore: vllm serve meta-llama/Meta-Llama-3-8B --disable-log-requests -O "{'level': 3, 'candidate_compile_sizes': [$(seq -s, 1 1 256)], 'splitting_ops': []}"
INFO 12-11 22:13:19 backends.py:371] Dynamo bytecode transform time: 4.81 s
INFO 12-11 22:13:22 backends.py:163] Compiling a graph for general shape takes 0.68 s
INFO 12-11 22:13:26 backends.py:166] Compiling a graph for shape 256 takes 0.41 s
INFO 12-11 22:13:27 backends.py:166] Compiling a graph for shape 248 takes 0.41 s
INFO 12-11 22:13:28 backends.py:166] Compiling a graph for shape 240 takes 0.48 s
INFO 12-11 22:13:29 backends.py:166] Compiling a graph for shape 232 takes 0.46 s
INFO 12-11 22:13:30 backends.py:166] Compiling a graph for shape 224 takes 0.47 s
INFO 12-11 22:13:31 backends.py:166] Compiling a graph for shape 216 takes 0.39 s
INFO 12-11 22:13:32 backends.py:166] Compiling a graph for shape 208 takes 0.44 s
INFO 12-11 22:13:32 backends.py:166] Compiling a graph for shape 200 takes 0.36 s
INFO 12-11 22:13:33 backends.py:166] Compiling a graph for shape 192 takes 0.41 s
INFO 12-11 22:13:34 backends.py:166] Compiling a graph for shape 184 takes 0.40 s
INFO 12-11 22:13:35 backends.py:166] Compiling a graph for shape 176 takes 0.46 s
INFO 12-11 22:13:36 backends.py:166] Compiling a graph for shape 168 takes 0.36 s
INFO 12-11 22:13:36 backends.py:166] Compiling a graph for shape 160 takes 0.30 s
INFO 12-11 22:13:37 backends.py:166] Compiling a graph for shape 152 takes 0.41 s
INFO 12-11 22:13:38 backends.py:166] Compiling a graph for shape 144 takes 0.50 s
INFO 12-11 22:13:39 backends.py:166] Compiling a graph for shape 136 takes 0.45 s
INFO 12-11 22:13:40 backends.py:166] Compiling a graph for shape 128 takes 0.47 s
INFO 12-11 22:13:41 backends.py:166] Compiling a graph for shape 120 takes 0.47 s
INFO 12-11 22:13:41 backends.py:166] Compiling a graph for shape 112 takes 0.43 s
INFO 12-11 22:13:42 backends.py:166] Compiling a graph for shape 104 takes 0.51 s
INFO 12-11 22:13:43 backends.py:166] Compiling a graph for shape 96 takes 0.57 s
INFO 12-11 22:13:44 backends.py:166] Compiling a graph for shape 88 takes 0.42 s
INFO 12-11 22:13:45 backends.py:166] Compiling a graph for shape 80 takes 0.44 s
INFO 12-11 22:13:46 backends.py:166] Compiling a graph for shape 72 takes 0.56 s
INFO 12-11 22:13:47 backends.py:166] Compiling a graph for shape 64 takes 0.39 s
INFO 12-11 22:13:48 backends.py:166] Compiling a graph for shape 56 takes 0.41 s
INFO 12-11 22:13:49 backends.py:166] Compiling a graph for shape 48 takes 0.39 s
INFO 12-11 22:13:50 backends.py:166] Compiling a graph for shape 40 takes 0.50 s
INFO 12-11 22:13:50 backends.py:166] Compiling a graph for shape 32 takes 0.43 s
INFO 12-11 22:13:51 backends.py:166] Compiling a graph for shape 24 takes 0.53 s
INFO 12-11 22:13:52 backends.py:166] Compiling a graph for shape 16 takes 0.34 s
INFO 12-11 22:13:53 backends.py:166] Compiling a graph for shape 8 takes 0.52 s
INFO 12-11 22:13:54 backends.py:166] Compiling a graph for shape 4 takes 0.52 s
INFO 12-11 22:13:55 backends.py:166] Compiling a graph for shape 2 takes 0.45 s
INFO 12-11 22:13:56 backends.py:166] Compiling a graph for shape 1 takes 0.68 s
INFO 12-11 22:13:56 monitor.py:31] torch.compile takes 21.25 s in total Therefore, I decided to remove piecewise compile for v0, while still keeping piecewise compile for v1. cc @bnellnm |
throughput benchmark: # run baseline first, to find out the number of scheduler steps to keep gpus busy
$ python benchmarks/benchmark_throughput.py --dataset ShareGPT_V3_unfiltered_cleaned_split.json --model meta-llama/Meta-Llama-3-8B
Throughput: 30.52 requests/s, 12620.07 total tokens/s, 6052.88 output tokens/s
$ python benchmarks/benchmark_throughput.py --dataset ShareGPT_V3_unfiltered_cleaned_split.json --model meta-llama/Meta-Llama-3-8B --num-scheduler-steps 8
Throughput: 43.57 requests/s, 18018.38 total tokens/s, 8642.04 output tokens/s
$ python benchmarks/benchmark_throughput.py --dataset ShareGPT_V3_unfiltered_cleaned_split.json --model meta-llama/Meta-Llama-3-8B --num-scheduler-steps 10
Throughput: 44.04 requests/s, 18212.79 total tokens/s, 8735.28 output tokens/s
$ python benchmarks/benchmark_throughput.py --dataset ShareGPT_V3_unfiltered_cleaned_split.json --model meta-llama/Meta-Llama-3-8B --num-scheduler-steps 12
Throughput: 44.65 requests/s, 18464.38 total tokens/s, 8855.95 output tokens/s
$ python benchmarks/benchmark_throughput.py --dataset ShareGPT_V3_unfiltered_cleaned_split.json --model meta-llama/Meta-Llama-3-8B --num-scheduler-steps 16
Throughput: 45.03 requests/s, 18622.54 total tokens/s, 8931.81 output tokens/s
$ python benchmarks/benchmark_throughput.py --dataset ShareGPT_V3_unfiltered_cleaned_split.json --model meta-llama/Meta-Llama-3-8B --num-scheduler-steps 20
Throughput: 44.18 requests/s, 18269.12 total tokens/s, 8762.30 output tokens/s
# the best number of scheduler step is 16, run this setting with `torch.compile`
python benchmarks/benchmark_throughput.py --dataset ShareGPT_V3_unfiltered_cleaned_split.json --model meta-llama/Meta-Llama-3-8B --num-scheduler-steps 16 -O "{'level': 3, 'candidate_compile_sizes': [$(seq -s, 1 1 256)]}"
Throughput: 46.93 requests/s, 19405.53 total tokens/s, 9307.35 output tokens/s Now we don't need to profile the batchsize distribution anymore. We can just compile for all sizes. |
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
I think there might be something wrong inside Inductor when I try llama 3 70B model (with tp 4). Even when I directly cache the graph hash, every shape takes 90 seconds to load the graph. I will stop here as the results on llama 3 8B are quite good, compilation time is greatly reduced. The llama 3 70B case should be a bug to fix in the future. |
vllm/config.py
Outdated
if self.model_config is not None and \ | ||
not self.compilation_config.cache_dir: | ||
# generate a cache directory based on the model information | ||
# TODO: consider more factors that will affect model forward, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I missed some quantization args that can affect model execution, but I don't know how to pull out all factors that affect quantization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vLLM version? We can add the git SHA to the key
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to be a large source of potential bugs so definitely should be careful here. Most quantization related stuff from NM goes in the model_config
but there's a lot of arguments to LLM
that can affect things like dtype
and quantization
. Are these in the key already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not yet. that's why I want to ask for reviews.
one direction is we consider all factors affecting compilation, and we can use compilation cache by default.
another approach is we don't cache by default, but tell user the cache directory, and users can specify the cache directory if they know nothing changed.
which one would you prefer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should always check the known factors when we cache, and expose an accessible switch for enabling/disabling caching. And then it's less important whether it's on by default or not. And for that decision @robertgshaw2-neuralmagic should chime in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added more factors to consider in 2a7f729 . Let me know if I miss anything.
vllm/config.py
Outdated
self.cache_dir = os.path.join( | ||
self.cache_dir, f"rank_{vllm_config.parallel_config.rank}") | ||
os.makedirs(self.cache_dir, exist_ok=True) | ||
self.inductor_hash_cache_path = os.path.join(self.cache_dir, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be better if we also save a serialized form of the config, but we need to design the serialized format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which config is not serializable? Isn't CompilationConfig
serializable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is serializable, but i want a human-readable form, so that we can also manually check the config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two main notes:
- Figuring out what the "key" is for the cache should be a method on config, likely on each sub-config as well. That way it's clear when developers modify config they might need to modify the key as well.
- I think
InductorHashCache
is a nice abstraction but it shouldn't live inside config (both file- and structure-wise). It should also take on more responsibilities, I think we can make it cleaner, so thatwrap_inductor
doesn't contain caching logic - it just calls the appropriate methods on the cache (or cache manager) object
vllm/config.py
Outdated
@@ -2212,6 +2215,53 @@ class CompilationLevel: | |||
PIECEWISE = 3 | |||
|
|||
|
|||
class InductorHashCache: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not put into a separate file?
vllm/config.py
Outdated
self.cache_dir = os.path.join( | ||
self.cache_dir, f"rank_{vllm_config.parallel_config.rank}") | ||
os.makedirs(self.cache_dir, exist_ok=True) | ||
self.inductor_hash_cache_path = os.path.join(self.cache_dir, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which config is not serializable? Isn't CompilationConfig
serializable?
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
@@ -71,6 +71,7 @@ | |||
VLLM_USE_V1: bool = False | |||
VLLM_ENABLE_V1_MULTIPROCESSING: bool = False | |||
VLLM_LOG_BATCHSIZE_INTERVAL: float = -1 | |||
VLLM_DISABLE_COMPILE_CACHE: bool = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added the flag to disable the compile cache. the compile cache is used by default.
# set flags so that Inductor and Triton store their cache | ||
# in the cache_dir, then users only need to copy the cache_dir | ||
# to another machine to reuse the cache. | ||
inductor_cache = os.path.join(cache_dir, "inductor_cache") | ||
os.makedirs(inductor_cache, exist_ok=True) | ||
os.environ["TORCHINDUCTOR_CACHE_DIR"] = inductor_cache | ||
triton_cache = os.path.join(cache_dir, "triton_cache") | ||
os.makedirs(triton_cache, exist_ok=True) | ||
os.environ["TRITON_CACHE_DIR"] = triton_cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
redirect inductor/triton cache to the vllm cache location.
In the future, I plan to dump more information in the cache directory, including:
the goal is to make |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with Tyler's comments about config caching, otherwise LGTM!
Signed-off-by: youkaichao <[email protected]>
factors.append(self.inductor_compile_config) | ||
factors.append(self.inductor_passes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about candidate_compile_sizes
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they do not affect the computation graph.
say we compile for candidate_compile_sizes = [1, 2, 4]
, and then run again with candidate_compile_sizes = [1, 2, 4, 8]
, we want them to share the same cache directory, so that we can directly load the cache for [1, 2, 4]
, and only compile for a new shape 8
.
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: youkaichao <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me now! Some suggestions on comments and logging, but LGTM otherwise
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]>
directly bypass aot-autograd and inductor, and load from cache