Server resource management #214

iboB · 2024-12-10T13:51:13Z

iboB
Dec 10, 2024
Maintainer

There exists a discrepancy between resource management in servers and edge devices.

Our API is edge-friendly. It's relatively high-level. The lowest we go so far is get_token, push_prompt. complete_text or generate_image are even higher level.

These high-level ops are internally composed of multiple tasks. For example get_token decode the pending prompt and then sample a token on the CPU (to be decoded on the next iteration)

On an edge device which only runs a single inference instance this is OK and inevitable, as the sampled token is needed for the next decode. However on a server we have several issues:

The server has potentially multiple pending independent jobs
The sample time is not negligible and if it's on the CPU, the GPU is idle during that time. We have absolutely no way of knowing when the GPU is idle from the point of view of an op.
The inference/decode itself may be such that it doesn't utilize the GPU completely. Its shape may be such that a part of the GPU is idle.

If one is writing a server for a dedicated model, they would have finer grain of the resources as they will have access to an actually low level API clearly separating CPU and GPU tasks. Fine grain resource management will be possible.

So... the fear is that with our design we would never be able to compete with a dedicated server in terms of utilization. We will always be at least several percent slower than a dedicated solution.

I don't know whether we can do something about it. The only possible approach would be to require all plugins to use a shared compute library (like ggml) and then manage some state from there

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server resource management #214

{{title}}

Replies: 0 comments

Select a reply

Server resource management #214

iboB Dec 10, 2024 Maintainer

Replies: 0 comments

iboB
Dec 10, 2024
Maintainer