-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Prototyping re-arch #9166
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First part of review: scheduler and KV cache manager.
High-level notes on the new scheduler and KV cache manager:
- Assuming chunked prefill always on.
- I still have some confusion on how does the new scheduler works for decoding requests. See comment below.
- No sequence group (which is the goal of the re-arch)
- No CPU swapping (swapping will be in KV cache manager)
- All sharing via prefix caching
- However, still requires a prefix caching enabled KV cache manager.
# Reserve block id 0 for padding. | ||
self.free_block_ids = list(range(num_gpu_blocks)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is block 0 actually reserved?
num_tokens = request.num_tokens - request.num_computed_tokens | ||
num_tokens = min(num_tokens, token_budget) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we assume we always do chunked prefill right? How does the num_tokens
computation here work for decoding phase requests? Will request.num_tokens - request.num_computed_tokens
always be 1 in that case?
if preempted_reqs: | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this should be outside of the while loop.
if request.status == RequestStatus.WAITING: | ||
scheduled_new_reqs.append(request) | ||
elif request.status == RequestStatus.PREEMPTED: | ||
scheduled_resumed_reqs.append(request) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to distinguish these two? Is it for delta update optimization?
finished_req_ids=self.finished_req_ids, | ||
aborted_req_ids=self.aborted_req_ids, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Maybe add a comment to distinguish these two fields with other fields.
finished_req_ids=self.finished_req_ids, | |
aborted_req_ids=self.aborted_req_ids, | |
# These two fields are existing states in the scheduler instead of newly scheduled in this step. | |
finished_req_ids=self.finished_req_ids, | |
aborted_req_ids=self.aborted_req_ids, |
def abort_requests(self, request_ids: Union[str, Iterable[str]]) -> None: | ||
if isinstance(request_ids, str): | ||
request_ids = (request_ids, ) | ||
request_ids = set(request_ids) | ||
|
||
# TODO: Optimize this. | ||
for queue in [self.waiting, self.running]: | ||
aborted_reqs: List[Request] = [] | ||
for request in queue: | ||
if not request_ids: | ||
break | ||
if request.request_id in request_ids: | ||
request.status = RequestStatus.FINISHED_ABORTED | ||
aborted_reqs.append(request) | ||
request_ids.remove(request.request_id) | ||
|
||
for request in aborted_reqs: | ||
queue.remove(request) | ||
self.aborted_req_ids.add(request.request_id) | ||
self._free_request(request) | ||
|
||
def stop_requests(self, request_ids: Union[str, Iterable[str]]) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the differences between stop and abort? Can we merge the two functions?
logger = init_logger(__name__) | ||
|
||
|
||
class KVCacheManager: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the new interface for the block manager? Should we implement prefix caching/hierarchical cache in this class?
No description provided.