You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've started working on this. I'll have a draft CL with the implementation to make sure I have the logic right. Might need some help on what the interface changes to TextInferenceEngine will look like to make this possible.
Also, right now, the decoding in entirely greedy, should I continue to use greedy decoding for the speculative model as well?
Code location: https://github.com/TabbyML/tabby/blob/main/crates/llama-cpp-bindings/src/engine.cc
Reference: https://github.com/ggerganov/llama.cpp/blob/master/examples/speculative/speculative.cpp#L47
Implement speculative decoding to speed up certain models.
The text was updated successfully, but these errors were encountered: