[BUG] Inline loading doesn't respect config.yml
Async0x42 opened this issue · 3 comments
OS
Linux
GPU Library
CUDA 12.x
Python version
3.12
Describe the bug
When a model is loaded inline, it doesn't respect the parameters set in config.yml, such as when loading a model that defines a 128k ctx and the config.yml lists max_seq_len: 32768
, it will load with the full context.
I read that the inline loading is a shim for OAI compatability, but I believe the behaviour of the inline load should at least match the behaviour of the user setting the model name in the config.yml
Reproduction steps
Set inline loading to true, do not set a default model name, set max_seq_len: 32768
Launch TabbyAPI
In a frontend client, load a model that has 128k context specified in it's own configuration, TabbyAPI attempts to load it at full 128k context rather than what's specified in config.yml
Expected behavior
The model is loaded inline using values specified in TabbyAPI config.yml
Logs
No response
Additional context
No response
Acknowledgements
- I have looked for similar issues before submitting this one.
- I have read the disclaimer, and this issue is related to a code bug. If I have a question, I will use the Discord server.
- I understand that the developers have lives and my issue will be answered when possible.
- I understand the developers of this program are human, and I will ask my questions politely.
This is already possible in two ways:
- Create a
tabby_config.yml
inside the model folder and addmax_seq_len: <whatever you want>
- Inside
config.yml
, add"max_seq_len"
to theuse_as_default
list. Please see the docs here
Closing this issue due to already existing functionality.
That's great, thanks!
@bdashore3 Regarding the tabby_config, is it supposed to differ from the main Tabby config.yml?
For example, in the tabby_config: you don't specify model: and then max_seq_len; you only specify max_seq_len. But if you want to include a draft model, then the draft model config matches the Tabby config.yml.
This is what works for both model and draft:
# Max sequence length (default: Empty).
# Fetched from the model's base sequence length in config.json by default.
max_seq_len: 32768
# Enable different cache modes for VRAM savings (default: FP16).
# Possible values: 'FP16', 'Q8', 'Q6', 'Q4'.
cache_mode: Q6
# Options for draft models (speculative decoding)
# This will use more VRAM!
draft_model:
# An initial draft model to load.
# Ensure the model is in the model directory.
draft_model_name: Qwen2.5-0.5B-exl2_4.0bpw
# Cache mode for draft models to save VRAM (default: FP16).
# Possible values: 'FP16', 'Q8', 'Q6', 'Q4'.
draft_cache_mode: Q6
I would expect that it's supposed to match the main tabby config.yml for formatting though, such as this (the model config params don't work doing it this way, but the draft_model params do work):
model:
# Max sequence length (default: Empty).
# Fetched from the model's base sequence length in config.json by default.
max_seq_len: 32768
# Enable different cache modes for VRAM savings (default: FP16).
# Possible values: 'FP16', 'Q8', 'Q6', 'Q4'.
cache_mode: Q6
# Options for draft models (speculative decoding)
# This will use more VRAM!
draft_model:
# An initial draft model to load.
# Ensure the model is in the model directory.
draft_model_name: Qwen2.5-0.5B-exl2_4.0bpw
# Cache mode for draft models to save VRAM (default: FP16).
# Possible values: 'FP16', 'Q8', 'Q6', 'Q4'.
draft_cache_mode: Q6