Implement cot decoding with llama.cpp
Closed this issue · 5 comments
Hi there, I'm trying to do the same thing as well. Currently I'm working with https://github.com/abetlen/llama-cpp-python?tab=readme-ov-file#low-level-api
Hope this might be helpful to you as well.
Cheers.
Hey that is great, it will be lovely if you can submit a PR, you can implement it as a plugin in optillm.
@codelion Hi, I just encounter some difficulties as I try to push forward.
As in this repo, a lot of methods are in fact custom sampling. And they require:
- when predicting, the model should give some possible next tokens with their corresponding probabilities, rather than just a token which is already sampled from the distribution.
- ability to evaluate some pre-exist context, returning the probability of the given context.
However, I checked both the low-level api in llama-cpp-python and vllm, they are kind of hard to work around and give these two abilities as functions/plugins. But this is still doable, the probabilities are there.
Another method, is what I call a strange and ugly coding method, is to utilize the current llama.cpp HTTP API, no need to configure more. And this is also capable with ollama as well. I do know lots of people use ollama due to it's easy set-up compared to llama.cpp. This method is to work around the HTTP API, especially the n_probs and grammar. In this case, we could force to model to give us the distribution (but only for the most possible tokens, not many) and the ability to evaluate some pre-exist context. Although this is an ugly method, but it may serve more people and less likely to have dependency breaks.
I'm writing this is ask for advice and discuss which method should we take. Thank you.
The best way to implement it may be in C++ directly here - https://github.com/ggerganov/llama.cpp/blob/master/src/llama-sampling.cpp
We now have cot_decoding directly implemented in optillm via the local inference server - https://github.com/codelion/optillm?tab=readme-ov-file#local-inference-server
Closing this ticket as this work needs to be implemented in llama.cpp, there is a dicsussion open in their repo here - ggerganov/llama.cpp#9620