mostlygeek/llama-swap

[Bug] TTL starts counting from the beginning of a request instead of the end

Closed this issue · 2 comments

This bug has only recently come to my attention when I started using shorter TTL values, and using a chatty model (QwQ). But it's very easy to reproduce with a very short TTL (like 10 seconds) and a prompt that will take longer to run than the TTL.

Steps to reproduce:

  1. Set TTL to 10 of a sufficiently large model
  2. Ask the model to tell a story. Make sure it generates a story that takes longer than 10 seconds to generate

Expected outcome:

  1. The model finishes generating the story, and the TTL will then start to count, giving you 10 seconds to ask a followup question

Actual outcome:

  1. llama-swap prints a "!!! Unloading model Qwen2.5-Coder-32B-Instruct-Q4_K_S, TTL of 10 reached." message midway through the generation. Thankfully it does not unload the model while it's still generating.
  2. But it does instantly unload the model after the prompt is done, resulting in reloads of the model if you ask a followup question.

Suggested fix:

  1. Consider the model idle when it finishes processing all requests, and start counting towards the TTL when that happens.
  2. Consider the model busy as soon as a new request comes in. The model is considered busy until it finishes. Only then will the TTL start counting.

Thanks for reporting this. I think I know exactly where the issue is. I’ll take a look

this should be fixed in v0.1.5.