-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1] Prefix caching (take 2) #9972
[V1] Prefix caching (take 2) #9972
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
c7e35a5
to
e6bd231
Compare
@comaniac Thanks for the great work! Is this PR ready for review? |
Yes I don't have more things to add. Please go ahead and review |
Will review the PR tonight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @comaniac for implementing this! Spent some time understanding the code but after understanding the code in general LGTM.
A high level question: Right now for a block, we have 3 ways to index it: block hash, block id, and the python KVCacheBlock
object itself. Do we have to have all 3? I assume we cannot just use block hash since not all blocks have hash. Can we remove block id and only use KVCacheBlock
object for index? Is the reason we keep block_id
is that we need to pass block_id
s to the worker?
So the question is about whether we could remove "block_id" attribute from the data class? I'm not sure but will take a look. |
e6bd231
to
2204969
Compare
@zhuohan123 I've indexed blocks using the object itself instead of block IDs. This does simplify code in many places so thanks for the suggestion. Meanwhile I still need to keep the block ID in the object in order to let scheduler build block tables. For prefix caching default on, I guess it might be better to enable it by default a bit later when VLM is in. If no other objections I plan to merge this PR by today. |
WDYT about enabling by default for non-VLMs? might be nice to have it exercised since we want it to be default soon anyhow. |
I'm ok with this proposal. WDYT @WoosukKwon |
The changes in the latest commit:
|
Signed-off-by: Cody Yu <[email protected]>
Signed-off-by: Cody Yu <[email protected]>
Signed-off-by: Cody Yu <[email protected]>
Signed-off-by: Cody Yu <[email protected]>
Signed-off-by: Cody Yu <[email protected]>
Signed-off-by: Cody Yu <[email protected]>
Signed-off-by: Cody Yu <[email protected]>
Signed-off-by: Cody Yu <[email protected]>
Signed-off-by: Cody Yu <[email protected]>
Signed-off-by: Cody Yu <[email protected]>
Signed-off-by: Cody Yu <[email protected]>
Signed-off-by: Cody Yu <[email protected]>
Signed-off-by: Cody Yu <[email protected]>
Signed-off-by: Cody Yu <[email protected]>
Signed-off-by: Cody Yu <[email protected]>
Signed-off-by: Cody Yu <[email protected]>
Signed-off-by: Cody Yu <[email protected]>
dc8a966
to
9c56442
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great work! This is super awesome! I never imagined prefix caching can be implemented so cleanly and efficiently 😮
Please rebase & reformat before merge.
Signed-off-by: Cody Yu <[email protected]> Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Cody Yu <[email protected]> Signed-off-by: OmerD <[email protected]>
Signed-off-by: Cody Yu <[email protected]> Signed-off-by: Loc Huynh <[email protected]>
Signed-off-by: Cody Yu <[email protected]>
Signed-off-by: Cody Yu <[email protected]> Signed-off-by: Sumit Dubey <[email protected]>
Signed-off-by: Cody Yu <[email protected]>
Signed-off-by: Cody Yu <[email protected]> Signed-off-by: Maxime Fournioux <[email protected]>
Signed-off-by: Cody Yu <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]>
Signed-off-by: Cody Yu <[email protected]>
This PR adds prefix caching to V1 (take 2). Take 1 is in #9668.
The main difference in take 2 is we adopt a custom doubly linked list to operate free blocks with eviction. This doubly linked list has the following features over the Python builtin
deque
:.remove()
operator in O(1) time.Benchmarks
Offline Batching
Online Serving
Server
Client
Data Structure
The same as Take 1.
Algorithms
Almost the same as Take 1 except for not lazy removal, because we now support remove in O(1) time.
Allocate Slots
When a request is scheduled for the first time,
allocate_slots()
is used to allocate blocks based on the current scheduled prompt tokens. If the prompt is chunked due to chunked prefill, we will only allocate blocks for the scheduled tokens. In addition to the scheduled tokens, we also pre-allocate empty blocks to reduce allocation overheads.With prefix caching, when we attempt to allocate a full block, we will compute its block hash and query the cached block map. There are 3 possible outcomes:
Note: When cache miss and we allocate a new block, the token IDs will be added to the allocated block to construct its hash. The block will also be added to the cache if it is full.
Append Slots
When a request is scheduled again,
append_slots()
is used to maybe allocate more blocks. This can be the case of continuous chunked prefill or decode. Here are the steps in the append slots:Free
When a request is done, all its blocks will decrease the reference count by 1. If a block now has 0 reference, it will be freed (push to the free block queue). Note that since we allocate new blocks by popping the free block queue, the block order in the free block queue is also the eviction order. Since we now use LRU eviction policy, the eviction order is
We maintain the above order by pushing free blocks to the queue in the reversed order, so that:
Get Computed Blocks
Before calling
allocate_slots()
, the scheduler callsget_computed_block_ids()
to know how many blocks hits the cache. This function simply computes the hash of full blocks and queries the cache for existing block IDs. This function won't allocate any block or change the block metadata.Duplication
Since V1 has incremental prepare inputs, the block table is append-only. This results in potential duplications as shown below. Suppose we have 2 identical requests (same prompt with greedy sampling) arriving at different time:
TIme 1
Time 2
TIme 3
TIme 4
At time 4, block becomes full and has the same hash and content as block 3. In vLLM V0 block manager, we will free block 4 and assign block 3 to req2 in the next step. However, we cannot do this in V1 because block table is append only. As a result, at this moment the cache will look like:
We consider that this is fine with practical use cases, because:
cc @WoosukKwon @zhuohan123 @njhill