Tabby server unable to scale with increasing connections #889

sundaraa-deshaw · 2023-11-24T07:42:52Z

Describe the bug
I started a local tabby server on a GPU (A100 80G) with a 13B model in the file system. While the completions work fine, as I send more concurrent requests, I see response times linearly increase with the number of connections.

Information about your version
I am running v0.5.5. Aside, when I run tabby commands, I get this error:

tabby: error while loading shared libraries: libllama.so: cannot open shared object file: No such file or directory

I start my tabby server with:

PATH="/usr/local/cuda/bin:$PATH" RUST_LOG=debug CUDA_VISIBLE_DEVICES="....." PROTOC=/..../bin/protoc cargo run serve --device cuda --model /path/to/model/dir

Information about your GPU
Please provide output of nvidia-smi

nvidia-smi
Fri Nov 24 02:36:44 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          On  | 00000000:4F:00.0 Off |                    0 |
| N/A   33C    P0              69W / 300W |  26878MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe          On  | 00000000:52:00.0 Off |                    0 |
| N/A   35C    P0              70W / 300W |   8649MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100 80GB PCIe          On  | 00000000:53:00.0 Off |                    0 |
| N/A   33C    P0              67W / 300W |   2731MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100 80GB PCIe          On  | 00000000:56:00.0 Off |                    0 |
| N/A   33C    P0              69W / 300W |   2731MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A100 80GB PCIe          On  | 00000000:57:00.0 Off |                    0 |
| N/A   35C    P0              76W / 300W |   2731MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A100 80GB PCIe          On  | 00000000:CE:00.0 Off |                    0 |
| N/A   34C    P0              70W / 300W |   4143MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A100 80GB PCIe          On  | 00000000:D1:00.0 Off |                    0 |
| N/A   35C    P0              71W / 300W |   1879MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A100 80GB PCIe          On  | 00000000:D2:00.0 Off |                    0 |
| N/A   34C    P0              71W / 300W |   1873MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   8  NVIDIA A100 80GB PCIe          On  | 00000000:D5:00.0 Off |                    0 |
| N/A   36C    P0              67W / 300W |   1875MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   9  NVIDIA A100 80GB PCIe          On  | 00000000:D6:00.0 Off |                    0 |
| N/A   37C    P0              72W / 300W |   9775MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

Additional context
I simulated concurrent requests with Locust and I see that requests hit the inference timeout of 30s with about 25-30 concurrent requests per second. Please find the test report for your reference.

What is the expected scale of tabby? I expected this to be high given that this is running on Hyper server.
Please let me know if I am missing any configuration.
I want to scale this tabby server instance to atleast a 100 users in parallel.

The text was updated successfully, but these errors were encountered:

wsxiaoys · 2023-11-24T08:49:11Z

Scaling out Tabby is a compelling topic that we plan to cover in our future blog series, for now, considering your requirements - 1. 8xA100 80G Server 2. Serving a team of approximately 100 people - I recommend the following setup:

Utilize a model with approximately 7 billion parameters from the registry (DeepSeekCoder-6.7B currently leads our leaderboard: https://tabby.tabbyml.com/docs/models/). Tune --parallelism argument (added in 0.6.0) to make sure it fits your vram (likely something between 8-12)
Deploy 8 Tabby Docker containers, each running on an individual A100 GPU. You might consider using docker-compose for orchestration: https://tabby.tabbyml.com/docs/installation/docker-compose/
Setup a load balancer (e.g Caddy) to sit in front of these 8 Tabby Docker container processes

This configuration should provide a reasonable balance between performance, quality, and latency.

If you are in our Slack channel, feel free to DM me (Meng Zhang) to discuss any issues you encounter. We would be happy to learn more about your use case and provide assistance.

simon376 · 2023-11-30T14:47:57Z

is there a way to separate out the Tabby server and replace it with another inference server that already takes care of these things, like 🤗text-generation-inference or vllm server?

erfanium · 2023-12-04T23:48:56Z

@simon376 I have managed to run deepseek-coder-6.7B-AMQ via huggingface tgi. that was straightforward.
In my case, I totally replaced tabby server and I only use tabby's vscode extension

It's a bit unrelated to this issue. fell free to DM me. I can help you (Discord link in my bio)

wsxiaoys · 2023-12-05T01:02:25Z

loadtest and reference datapoints is implemented in #906

sundaraa-deshaw added the bug-unconfirmed label Nov 24, 2023

wsxiaoys added documentation Improvements or additions to documentation and removed bug-unconfirmed labels Nov 24, 2023

wsxiaoys mentioned this issue Nov 24, 2023

Tabby loadtest tool #890

Closed

wsxiaoys closed this as completed Dec 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tabby server unable to scale with increasing connections #889

Tabby server unable to scale with increasing connections #889

sundaraa-deshaw commented Nov 24, 2023

wsxiaoys commented Nov 24, 2023 •

edited

Loading

simon376 commented Nov 30, 2023

erfanium commented Dec 4, 2023 •

edited

Loading

wsxiaoys commented Dec 5, 2023

Tabby server unable to scale with increasing connections #889

Tabby server unable to scale with increasing connections #889

Comments

sundaraa-deshaw commented Nov 24, 2023

wsxiaoys commented Nov 24, 2023 • edited Loading

simon376 commented Nov 30, 2023

erfanium commented Dec 4, 2023 • edited Loading

wsxiaoys commented Dec 5, 2023

wsxiaoys commented Nov 24, 2023 •

edited

Loading

erfanium commented Dec 4, 2023 •

edited

Loading