Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1][Core] Autotune encoder cache budget #11895
base: main
Are you sure you want to change the base?
[V1][Core] Autotune encoder cache budget #11895
Changes from 5 commits
495f669
8c67ecd
5938a1f
0e4ab3c
bd1ccf1
2a4b1d5
9ee3f3d
7614888
aaf3cef
e8f50f4
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment seems a bit confusing to me. I tried to rephrase base on my understanding but please help clarify:
num_items == 1:
num_items > 1:
# A batch can cover all (except the last one) multimodal items.
Meanwhile, I don't fully understand what you meant by "cached" and "chunked prefill" tho. I suppose they are orthogonal to the number of items?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will clarify this. During profiling we always take the worst case (i.e requests will all have the biggest possible multimodal item), so what I meant by
"cached" and "chunked prefill"
is that each multimodal item will always be needed in two engine steps, since the batch cannot cover the entirety of it.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarified via 2a4b1d5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is only applicable to the
else
block?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is applicable to all cases, and is in fact in the
if
block is how I discovered this issue that wasn't addressed prior to this PR.Here's a concrete example:
Suppose the
max_num_batched_token=8192
and two identical requests have length 16032 after processing, and their image withstart_index=7333
andend_index=16020
(thuslength=8687
), and supposeencoder_cache_budget=8687
for the sake of showing how the issue will happen when we don't add budget for one additional item.Time 0: Request 0 gets scheduled for 8192 tokens. Since
start_index=7333 < 8192 < end_index=16020
and cache is empty, image 0 gets processed and the result embeddings is cached, thus all space budget is used up.Time 1:
16032 - 8192 = 7840
tokens. An important note here is that scheduling is synchronous, therefore we treat these tokens are already computed once scheduled.16020 > 16016
, therefore the image 1 is needed here, but the space budget is used up since image 0 is still in the cache.vllm/vllm/v1/core/scheduler.py
Lines 380 to 387 in 9a22834
num_new_tokens
to7333 (start_pos) - 16016 (num_computed_tokens) = -8683
, and then crash the server as we cannot have non-positive num_new_tokens.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both cases would need this.
Also for this comment
This cannot address the issue fundamentally, because we also need to guarantee the item is always available in the encoder cache when we schedule the request. For example, an item used by request A and request B. Request A has finished so prefix and mm items are cached. However, due to encoder cache budget, one item in request A is evicted before request B comes. This would result in the same problem.
I guess this can somehow be avoided if we could guarantee all prefix cached mm items are always available in encoder cache as well, but fundamentally this has to be solved by supporting
num_tokens=0
in the model runner.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good callout! I've adjusted the comment accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note this is just a reordering for better readability.