Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tabby server unable to scale with increasing connections #889

Closed
sundaraa-deshaw opened this issue Nov 24, 2023 · 4 comments
Closed

Tabby server unable to scale with increasing connections #889

sundaraa-deshaw opened this issue Nov 24, 2023 · 4 comments
Labels
documentation Improvements or additions to documentation

Comments

@sundaraa-deshaw
Copy link

Describe the bug
I started a local tabby server on a GPU (A100 80G) with a 13B model in the file system. While the completions work fine, as I send more concurrent requests, I see response times linearly increase with the number of connections.

Information about your version
I am running v0.5.5. Aside, when I run tabby commands, I get this error:

tabby: error while loading shared libraries: libllama.so: cannot open shared object file: No such file or directory

I start my tabby server with:

PATH="/usr/local/cuda/bin:$PATH" RUST_LOG=debug CUDA_VISIBLE_DEVICES="....." PROTOC=/..../bin/protoc cargo run serve --device cuda --model /path/to/model/dir

Information about your GPU
Please provide output of nvidia-smi

nvidia-smi
Fri Nov 24 02:36:44 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          On  | 00000000:4F:00.0 Off |                    0 |
| N/A   33C    P0              69W / 300W |  26878MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe          On  | 00000000:52:00.0 Off |                    0 |
| N/A   35C    P0              70W / 300W |   8649MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100 80GB PCIe          On  | 00000000:53:00.0 Off |                    0 |
| N/A   33C    P0              67W / 300W |   2731MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100 80GB PCIe          On  | 00000000:56:00.0 Off |                    0 |
| N/A   33C    P0              69W / 300W |   2731MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A100 80GB PCIe          On  | 00000000:57:00.0 Off |                    0 |
| N/A   35C    P0              76W / 300W |   2731MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A100 80GB PCIe          On  | 00000000:CE:00.0 Off |                    0 |
| N/A   34C    P0              70W / 300W |   4143MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A100 80GB PCIe          On  | 00000000:D1:00.0 Off |                    0 |
| N/A   35C    P0              71W / 300W |   1879MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A100 80GB PCIe          On  | 00000000:D2:00.0 Off |                    0 |
| N/A   34C    P0              71W / 300W |   1873MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   8  NVIDIA A100 80GB PCIe          On  | 00000000:D5:00.0 Off |                    0 |
| N/A   36C    P0              67W / 300W |   1875MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   9  NVIDIA A100 80GB PCIe          On  | 00000000:D6:00.0 Off |                    0 |
| N/A   37C    P0              72W / 300W |   9775MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

Additional context
I simulated concurrent requests with Locust and I see that requests hit the inference timeout of 30s with about 25-30 concurrent requests per second. Please find the test report for your reference.

image

What is the expected scale of tabby? I expected this to be high given that this is running on Hyper server.
Please let me know if I am missing any configuration.
I want to scale this tabby server instance to atleast a 100 users in parallel.

@wsxiaoys
Copy link
Member

wsxiaoys commented Nov 24, 2023

Scaling out Tabby is a compelling topic that we plan to cover in our future blog series, for now, considering your requirements - 1. 8xA100 80G Server 2. Serving a team of approximately 100 people - I recommend the following setup:

  1. Utilize a model with approximately 7 billion parameters from the registry (DeepSeekCoder-6.7B currently leads our leaderboard: https://tabby.tabbyml.com/docs/models/). Tune --parallelism argument (added in 0.6.0) to make sure it fits your vram (likely something between 8-12)

  2. Deploy 8 Tabby Docker containers, each running on an individual A100 GPU. You might consider using docker-compose for orchestration: https://tabby.tabbyml.com/docs/installation/docker-compose/

  3. Setup a load balancer (e.g Caddy) to sit in front of these 8 Tabby Docker container processes

This configuration should provide a reasonable balance between performance, quality, and latency.

If you are in our Slack channel, feel free to DM me (Meng Zhang) to discuss any issues you encounter. We would be happy to learn more about your use case and provide assistance.

@wsxiaoys wsxiaoys added documentation Improvements or additions to documentation and removed bug-unconfirmed labels Nov 24, 2023
@simon376
Copy link

is there a way to separate out the Tabby server and replace it with another inference server that already takes care of these things, like 🤗text-generation-inference or vllm server?

@erfanium
Copy link
Contributor

erfanium commented Dec 4, 2023

@simon376 I have managed to run deepseek-coder-6.7B-AMQ via huggingface tgi. that was straightforward.
In my case, I totally replaced tabby server and I only use tabby's vscode extension

It's a bit unrelated to this issue. fell free to DM me. I can help you (Discord link in my bio)

@wsxiaoys
Copy link
Member

wsxiaoys commented Dec 5, 2023

loadtest and reference datapoints is implemented in #906

@wsxiaoys wsxiaoys closed this as completed Dec 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

4 participants