Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/evict req on client disconnect streaming case #223

Draft
wants to merge 32 commits into
base: main
Choose a base branch
from

Conversation

bhimrazy
Copy link
Contributor

@bhimrazy bhimrazy commented Aug 26, 2024

Before submitting
  • Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?

How does this PR impact the user?

As a user, I want the server to stop processing requests that disconnect from the client before finishing. This PR focuses on tracking disconnected requests (specifically for non-batched streaming mode) and stops those running tasks, saving resources and freeing up space for handling other requests.

What does this PR do?

Partially fixes #165.

  • Handles client request disconnection in streaming mode (non-batch).

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@bhimrazy bhimrazy marked this pull request as draft August 26, 2024 16:22
Copy link

codecov bot commented Aug 26, 2024

Codecov Report

Attention: Patch coverage is 85.18519% with 4 lines in your changes missing coverage. Please review.

Project coverage is 95%. Comparing base (44e0fe9) to head (49bed55).

Additional details and impacted files
@@         Coverage Diff         @@
##           main   #223   +/-   ##
===================================
- Coverage    95%    95%   -0%     
===================================
  Files        18     18           
  Lines      1173   1185   +12     
===================================
+ Hits       1112   1122   +10     
- Misses       61     63    +2     

@bhimrazy bhimrazy marked this pull request as ready for review August 26, 2024 16:31
Copy link
Collaborator

@aniketmaurya aniketmaurya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @bhimrazy! looking good.

src/litserve/server.py Outdated Show resolved Hide resolved
@aniketmaurya aniketmaurya self-requested a review August 26, 2024 18:35
Copy link
Collaborator

@aniketmaurya aniketmaurya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will run some perf test for streaming before merging this PR!

@aniketmaurya aniketmaurya self-requested a review August 26, 2024 18:39
@bhimrazy bhimrazy marked this pull request as draft August 31, 2024 19:39
@bhimrazy bhimrazy marked this pull request as ready for review September 1, 2024 17:41
@bhimrazy
Copy link
Contributor Author

bhimrazy commented Sep 1, 2024

Hi @aniketmaurya,

I’ve made some modifications to check at specific intervals rather than on each output, hoping this approach might minimize any impact.

However, if this PR is affecting performance, I’m more than willing to close it. We can then explore alternative solutions that might be more effective.

@bhimrazy bhimrazy marked this pull request as draft September 4, 2024 19:35
@lantiga
Copy link
Collaborator

lantiga commented Sep 10, 2024

Hey @bhimrazy thanks for the patience, we're going to benchmark soon!

Comment on lines +283 to +287
check_interval = 50
for index, y_enc in enumerate(y_enc_gen):
if index % check_interval == 0 and request_evicted_status.get(uid):
request_evicted_status.pop(uid)
break
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking the request_evicted_status for each token appears to have a significant impact, reducing performance from 3600 to around 3100. However, it may not be necessary to perform this check on every token.

While adding a check interval helps reduce the overhead and brings the performance closer to that of the main branch, but it still doesn't feel like an ideal solution.

Copy link
Collaborator

@aniketmaurya aniketmaurya Sep 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for your patience with the PR and checking the speed issue @bhimrazy 🙌 .

yeah, and in case when the time-to-first-token is large but rest of the token stream speed is fast, it doesn't help much.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just single worker. with multiple workers it might impact even more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the overall design is correct, we are just way too aggressive checking the distributed dict and we get into contention problems.

One alternative that could reduce contention is getting a snapshot of the disconnected dictionary in every worker loop: so not use a managed dict but a shared value that the server publishes and that gets read as a whole by each worker periodically (every N seconds - we don't need a thread, we just check the time at every loop). This way every worker has a semi-up to date local dictionary that it can check as often as we want.

Having semi-up to date info on who disconnected every N seconds is totally fine, we don't need to react immediately.

This design also helps with ignoring items in the queue that come from clients that have been disconnected. For those we necessarily have to check at every request. If the local dictionary is not up to date we'll run some requests for nothing, but that's ok. One caveat is making sure the responses don't accumulate in the response dictionary on the webserver process, in this case (let's remember about this).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @lantiga, for the valuable insights. This approach seems promising. I'll take some time to study the concept and work on the implementation shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Evict requests if the client has disconnected
3 participants