-
I am trying to use llama-cpp-python, but I am getting 22 tokens per second instead of the 25 tokens per second that I usually get under regular llama-cpp. Furthermore, looking at the GPU load, it's only hitting about 80% ish GPU load versus 100% load with pure llama-cpp. How can I get llama-cpp-python to perform the same? I am running both in docker with the same base image, so I should be getting identical speeds in both. Here is the Dockerfile for llama-cpp with good performance:
Performance is evaluated through the following command: The Dockerfile for llama-cpp-python looks as follows:
The models.conf looks as follows:
Performance under llama-cpp-python is evaluated through checking the logs of the server, while querying the server through openweb ui connected through the OpenAI compatible API. I do see something in the logs when using llama-cpp-python that I do not see when using llama-cpp which might explain the difference in GPU load and performance, as some tensors are running on the CPU for some reason. But I do not know why and how I can fix this. Does anyone have any idea?
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Managed to find the answer myself. For some reason the |
Beta Was this translation helpful? Give feedback.
Managed to find the answer myself. For some reason the
logits_all
parameter defaults totrue
and tanks performance. Setting it to false brings the performance on par with pure llama-cpp. Not sure if that's a sensible default, but at least I managed to solve the problem. GPU load is also back to 100% again.