[BUG] Inline loading doesn't respect config.yml

Question

[BUG] Inline loading doesn't respect config.yml

Async0x42 opened this issue 2 months ago · 3 comments

Async0x42 commented 2 months ago

OS

Linux

GPU Library

CUDA 12.x

Python version

3.12

Describe the bug

When a model is loaded inline, it doesn't respect the parameters set in config.yml, such as when loading a model that defines a 128k ctx and the config.yml lists max_seq_len: 32768, it will load with the full context.

I read that the inline loading is a shim for OAI compatability, but I believe the behaviour of the inline load should at least match the behaviour of the user setting the model name in the config.yml

Reproduction steps

Set inline loading to true, do not set a default model name, set max_seq_len: 32768
Launch TabbyAPI
In a frontend client, load a model that has 128k context specified in it's own configuration, TabbyAPI attempts to load it at full 128k context rather than what's specified in config.yml

Expected behavior

The model is loaded inline using values specified in TabbyAPI config.yml

Logs

No response

Additional context

No response

Acknowledgements

I have looked for similar issues before submitting this one.
I have read the disclaimer, and this issue is related to a code bug. If I have a question, I will use the Discord server.
I understand that the developers have lives and my issue will be answered when possible.
I understand the developers of this program are human, and I will ask my questions politely.

Answer 1 · 2024-10-26T04:31:15.000Z

This is already possible in two ways:

Create a tabby_config.yml inside the model folder and add max_seq_len: <whatever you want>
Inside config.yml, add "max_seq_len" to the use_as_default list. Please see the docs here

Closing this issue due to already existing functionality.

Answer 2 · 2024-10-27T00:06:51.000Z

That's great, thanks!

Answer 3 · 2024-11-01T19:27:22.000Z

@bdashore3 Regarding the tabby_config, is it supposed to differ from the main Tabby config.yml?

For example, in the tabby_config: you don't specify model: and then max_seq_len; you only specify max_seq_len. But if you want to include a draft model, then the draft model config matches the Tabby config.yml.

This is what works for both model and draft:

# Max sequence length (default: Empty).
# Fetched from the model's base sequence length in config.json by default.
max_seq_len: 32768

# Enable different cache modes for VRAM savings (default: FP16).
# Possible values: 'FP16', 'Q8', 'Q6', 'Q4'.
cache_mode: Q6

# Options for draft models (speculative decoding)
# This will use more VRAM!
draft_model:
  # An initial draft model to load.
  # Ensure the model is in the model directory.
  draft_model_name: Qwen2.5-0.5B-exl2_4.0bpw

  # Cache mode for draft models to save VRAM (default: FP16).
  # Possible values: 'FP16', 'Q8', 'Q6', 'Q4'.
  draft_cache_mode: Q6

I would expect that it's supposed to match the main tabby config.yml for formatting though, such as this (the model config params don't work doing it this way, but the draft_model params do work):

model:
  # Max sequence length (default: Empty).
  # Fetched from the model's base sequence length in config.json by default.
  max_seq_len: 32768

  # Enable different cache modes for VRAM savings (default: FP16).
  # Possible values: 'FP16', 'Q8', 'Q6', 'Q4'.
  cache_mode: Q6

# Options for draft models (speculative decoding)
# This will use more VRAM!
draft_model:
  # An initial draft model to load.
  # Ensure the model is in the model directory.
  draft_model_name: Qwen2.5-0.5B-exl2_4.0bpw

  # Cache mode for draft models to save VRAM (default: FP16).
  # Possible values: 'FP16', 'Q8', 'Q6', 'Q4'.
  draft_cache_mode: Q6