Why do local models take FAR more energy/memory/time than in command line?

Question

Why do local models take FAR more energy/memory/time than in command line?

Closed this issue 4 days ago · 2 comments

Context for the issue:

I am running obsidian on an M2 MacBook Air with 16GB of memory.
I have approx. 1000 (1k) notes in obsidian. My vault is around 285.8 MB.

ISSUE:

I have no issues when using Ollama from the command line (llama3.1, mistral, or gemma2). Computer heat and memory usage remains nominal and unconcerning.
When using "View Smart Connections" I see no issues. Performance with the native embedding model is fine.
HOWEVER, when I use the "Smart Chat Conversation" (with any of the Ollama models listed above), computer memory and energy usage spikes considerably, my computer heats up to an uncomfortable level, and responses are FAR slower than in the command line.

Why is this? Is it simply that my vault is being loaded into the model and/or being used as part of the "context"? Or is there a "double" loading of the LLM (I know - probably a dumb question)? i.e. - is the computer loading it once in command line and another time in obsidian, loading down the computer? Further, I do not fully understand why answers in the obsidian chat are so slow, when compared to asking identical prompts in the command line. I.e. 'Tell me about Isaac Newton' may take a second or two in command line - but 15 to 30+ seconds when asking the identical question in obsidian. Thank you all for your help in this! GREAT Plug in :)

Answer 1 · 2024-09-15T12:37:39.000Z

Hi @Luke2791

Given the example would trigger a context lookup (contains self-referential pronoun), the round-trip response would be expected to take >2X as long as one without context lookup.

This is because two requests are made to the LLM (see this explanation of HyDE to learn more about why).

Notably, the second request will contain a lot of context, so it would be like running a request with a really long (pages long) prompt. This is probably the cause of your computer's noticeable increase in resource usage.

Perception is also partly the issue since Ollama cannot stream results to Obsidian due to a CORS issue. This means that the time comparison needs to be compared after the last and not the first tokens since Smart Chat has to wait for all tokens to be generated before receiving the result.

I hope that helps clear some things up.
🌴

Answer 2 · 2024-09-15T19:22:08.000Z

Great response and very helpful! Thank you, @brianpetro !
Also - I see somewhere else someone mentioned including the option for LM Studio in a future release - that would be excellent! LM studio also has an option for CORS that might help resolve the issue that Ollama is facing (as opposed to using Llama3.1 via LM Studio with CORS turned on).