You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There exists a discrepancy between resource management in servers and edge devices.
Our API is edge-friendly. It's relatively high-level. The lowest we go so far is get_token, push_prompt. complete_text or generate_image are even higher level.
These high-level ops are internally composed of multiple tasks. For example get_token decode the pending prompt and then sample a token on the CPU (to be decoded on the next iteration)
On an edge device which only runs a single inference instance this is OK and inevitable, as the sampled token is needed for the next decode. However on a server we have several issues:
The server has potentially multiple pending independent jobs
The sample time is not negligible and if it's on the CPU, the GPU is idle during that time. We have absolutely no way of knowing when the GPU is idle from the point of view of an op.
The inference/decode itself may be such that it doesn't utilize the GPU completely. Its shape may be such that a part of the GPU is idle.
If one is writing a server for a dedicated model, they would have finer grain of the resources as they will have access to an actually low level API clearly separating CPU and GPU tasks. Fine grain resource management will be possible.
So... the fear is that with our design we would never be able to compete with a dedicated server in terms of utilization. We will always be at least several percent slower than a dedicated solution.
I don't know whether we can do something about it. The only possible approach would be to require all plugins to use a shared compute library (like ggml) and then manage some state from there
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
There exists a discrepancy between resource management in servers and edge devices.
Our API is edge-friendly. It's relatively high-level. The lowest we go so far is
get_token
,push_prompt
.complete_text
orgenerate_image
are even higher level.These high-level ops are internally composed of multiple tasks. For example
get_token
decode the pending prompt and then sample a token on the CPU (to be decoded on the next iteration)On an edge device which only runs a single inference instance this is OK and inevitable, as the sampled token is needed for the next decode. However on a server we have several issues:
If one is writing a server for a dedicated model, they would have finer grain of the resources as they will have access to an actually low level API clearly separating CPU and GPU tasks. Fine grain resource management will be possible.
So... the fear is that with our design we would never be able to compete with a dedicated server in terms of utilization. We will always be at least several percent slower than a dedicated solution.
I don't know whether we can do something about it. The only possible approach would be to require all plugins to use a shared compute library (like ggml) and then manage some state from there
Beta Was this translation helpful? Give feedback.
All reactions