twinnydotdev/twinny

Ideal setup of parallel chat and fim models

Closed this issue · 2 comments

Hey, this isn't a real feature request or bug but I don't know where else to ask. Twinny has recently gained the ability to use separate chat and fim providers which is great. Now if I want to use ollama for both what would be the ideal setup? I currently assume that ollama can only have one model in memory and unloads/loads a new one so having chat and fim in parallel means a lot of loading/unloading? So should 2 instances be run in parallel? I have this now set up with two ollama containers exposing on different ports. Is that a good idea? Are there better ways to set it up? Maybe the docs could point out some sensible example deployments?

Hey @kirel thanks for the interest.

Good question! In all honesty, I am not sure of the optimal setup and did not read up on how model switching works in Ollama, maybe it is in their documentation?

I guess it's up to the individual what the best way to handle inference apis are. For me personally, I use Ollama code-llama:13b-code for FIM and a mixture of instruct models and LiteLLM proxying GPT-4 for chat depending on if the work is sensitive or not. If you have any interesting findings on this matter, please feel free to open a pull request so that other are aware, I welcome changes like this as documentation is not something I spend a lot of time on and would appreciate the help.

Many thanks,

Closing the issue as it's a hardware/docs question and not really an issue with the software.