[Bug] TTL starts counting from the beginning of a request instead of the end
Closed this issue · 2 comments
Mushoz commented
This bug has only recently come to my attention when I started using shorter TTL values, and using a chatty model (QwQ). But it's very easy to reproduce with a very short TTL (like 10 seconds) and a prompt that will take longer to run than the TTL.
Steps to reproduce:
- Set TTL to 10 of a sufficiently large model
- Ask the model to tell a story. Make sure it generates a story that takes longer than 10 seconds to generate
Expected outcome:
- The model finishes generating the story, and the TTL will then start to count, giving you 10 seconds to ask a followup question
Actual outcome:
- llama-swap prints a "!!! Unloading model Qwen2.5-Coder-32B-Instruct-Q4_K_S, TTL of 10 reached." message midway through the generation. Thankfully it does not unload the model while it's still generating.
- But it does instantly unload the model after the prompt is done, resulting in reloads of the model if you ask a followup question.
Suggested fix:
- Consider the model idle when it finishes processing all requests, and start counting towards the TTL when that happens.
- Consider the model busy as soon as a new request comes in. The model is considered busy until it finishes. Only then will the TTL start counting.
mostlygeek commented
Thanks for reporting this. I think I know exactly where the issue is. I’ll take a look
mostlygeek commented
this should be fixed in v0.1.5.