Clarification: proxy or library for cot_decoding??
lee-b opened this issue · 2 comments
The documentation is unclear. It looks like the proxy code doesn't support cot_decoding at all, right? So it's only available as a library and the python notebook demo right now?
Or are you saying that cot_decoding will work with the proxy if a suitable server is used that provides multiple choices in the chat response, like ollama (but not llama.cpp).
I think it's the former, since the proxy code doesn't seem to include cot_decoding, but would like a clear statement on this.
Thanks!
cot_decoding cannot be implemented via a proxy with only API access to model. It need to look at the logits of tokens before they are decoded. I have implemented it using transformers library so the model needs to be loaded in using AutoModelForCausalLM.from_pretrained(model_name)
as shown in the colab.
Multiple responses when they are returned via api are already decoded using some existing strategy like beam search. To implement it in llama.cpp or vllm we will need to integrate it into the upstream library that does the decoding.
Got it, thanks :)