-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tabby server unable to scale with increasing connections #889
Comments
Scaling out Tabby is a compelling topic that we plan to cover in our future blog series, for now, considering your requirements - 1. 8xA100 80G Server 2. Serving a team of approximately 100 people - I recommend the following setup:
This configuration should provide a reasonable balance between performance, quality, and latency. If you are in our Slack channel, feel free to DM me (Meng Zhang) to discuss any issues you encounter. We would be happy to learn more about your use case and provide assistance. |
is there a way to separate out the Tabby server and replace it with another inference server that already takes care of these things, like 🤗text-generation-inference or vllm server? |
@simon376 I have managed to run deepseek-coder-6.7B-AMQ via huggingface tgi. that was straightforward. It's a bit unrelated to this issue. fell free to DM me. I can help you (Discord link in my bio) |
loadtest and reference datapoints is implemented in #906 |
Describe the bug
I started a local tabby server on a GPU (A100 80G) with a 13B model in the file system. While the completions work fine, as I send more concurrent requests, I see response times linearly increase with the number of connections.
Information about your version
I am running v0.5.5. Aside, when I run
tabby
commands, I get this error:I start my tabby server with:
Information about your GPU
Please provide output of
nvidia-smi
Additional context
I simulated concurrent requests with Locust and I see that requests hit the inference timeout of 30s with about 25-30 concurrent requests per second. Please find the test report for your reference.
What is the expected scale of tabby? I expected this to be high given that this is running on Hyper server.
Please let me know if I am missing any configuration.
I want to scale this tabby server instance to atleast a 100 users in parallel.
The text was updated successfully, but these errors were encountered: