Integration of llama.cpp and whisper.cpp #5

iboB · 2024-07-11T10:47:39Z

iboB
Jul 11, 2024
Maintainer

I'm gonna be talking about llama.cpp here, but all of this is exactly applicable to whisper.cpp as well (including libcommon)

Problems and drama

C interface

The library provides a C interface, whereas the C++ code is entirely in the .cpp file and not accessible in a conventional manner. This imposes many problems:

from the mundane: that we can't just use unique_ptr<T> but have to add custom deleters for library types, which invoke the C functions (which internally call delete)
... through the annoying: we can't provide detailed errors from our side as the C functions only return non-zero (-1 most often) on error. Well... we could do something hacky, like capture the library logs and on error, assume (hope?) that the last error log was the detailed description of the error.
... to the impossible: finer grain control over the library is inaccessible. The C API assumes a use case which is not necessarily ours and we have no control over it.

The last point has a somewhat complicated example: backends. Ggml itself supports multiple backends, but llama.cpp supports two. CPU and non-CPU. The latter being whatever from the available models you have first. This is not a huge issue for mobile platforms as there are none (to my knowledge) which support more than one non-CPU backend, but it is on desktop. Say I build ggml and llama.cpp with both Vulkan and CUDA support. Running the code would only allow me access to CUDA (and CPU) because it happens to be first backend from these which is checked by the API. To access Vulkan, I have to build a binary without CUDA support.

There are other problems like this: mmap control, file i/o control, global state, vocabulary access. All in the .cpp

To actually support multiple backends one would need to build llama.cpp multiple times and load the required instances of library as a plugin (dlopen/dlsym). That's what ollama does.

Third party libraries.

llama.cpp uses third party libraries (like stb_image and nlohmann::json). It does so by having copies of them int its codebase. If statically link with llama.cpp (or its helper library common which is always static), and happen to use different versions of any of these libraries as well, this will lead to linker errors, or more-likely and terrifyingly, to ODR violations manifesting as crashes. We would ideally link statically on mobile platforms.

Moreover it does so without relying on CMake or a build system, but by directly adding the sources to their targets. An if(TARGET... approach as the one taken with ggml itself is not applicable here.

Possible solutions

Use it as is. Don't propagate detailed errors. Just use the quick and dirty solution.
- Suffer the poor interface and lack of certain features.
- Suffer having a single non-CPU backend
- Does not solve third party libs
Reimplement. Well copy-paste mostly, but as a whole skip relying on the llama.cpp repo and instead selectively pull code for our needs, relying only on ggml.
- This is significantly more work than the previous option
- Merging with upstream will be very tough. It would need to be done with a lot of thought and regard for what and how is affected
- We would have control
- Solves third party libs
- There's a distant opportunity that our repo becomes more popular than llama.cpp as it would be more library friendly 😄
Instead of using the repo and its build system include the sources in our sources (as in #include "llama.cpp")
- This means that have access to the library in a single .cpp file of ours.
- Changes to the library will still have to be made. Though small ones: fixing static funcs with the same name, guarding includes... stuff like that. Definitely won't be as big of a merge blocker as reimplementing
- Essentially this is making llama.cpp and our integration into a unity build, compilation times may suffer
- Kinda solves third party libs as it would be possible to include our own and guard theirs with macros before
Third party libs-only solutions
- Instead of using versions of the libraries supplied by our build system, use the ones from llama.cpp
  - We would relinquish control over the versions
  - The CMake code to do this will be nasty and hacky
- Carefully make sure that the library versions are the same and hope for the best
- In a branch remove the versions of the affected libraries. Again this is diverging from upstream and will have a toll on merges, but likely a small one.
- Only reimplement libcommon and use llama.cpp as a shared library
  - libcommon is the one which has the most functionalities that we would want to touch anyway
  - It also contains a lot of functionality that we don't care about

I am leaning towards "Only reimplement libcommon and use llama.cpp as a shared library", again through creative copy-pasting where it makes sense. (for now). In the future, once we're past the MVP we should go and reimplement llama.cpp as well, and ditch the llama.cpp repo altogether.

This means that changes to libcommon would have to be manually tracked. This also goes for the main inference code, so it doesn't sound like a huge deal.

Including source files is the worst, and I'm against it.

Comments, questions, concerns, suggestions?

iboB · 2024-11-20T08:48:47Z

iboB
Nov 20, 2024
Maintainer Author

Plugins resolve the third party lib and dependency issues. The others remain, but we simply don't envision having the resource to do anything but "use as-is" in the foreseeable future. We'll go on like this. Closing the discussion

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration of llama.cpp and whisper.cpp #5

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Integration of llama.cpp and whisper.cpp #5

iboB Jul 11, 2024 Maintainer

Problems and drama

C interface

Third party libraries.

Possible solutions

Replies: 1 comment

iboB Nov 20, 2024 Maintainer Author

iboB
Jul 11, 2024
Maintainer

iboB
Nov 20, 2024
Maintainer Author