c0sogi/llama-api

Generation stops at 251 tokens - works fine on oobabooga

Closed this issue · 3 comments

I hate to be a pain. You have been so helpful already, but I am stuck.

My generations are ending prematurely: "finish_reason": "length" as seen below

{
"id": "chatcmpl-4f6ac32a-287f-41ba-a4ec-8768e70ad2c3",
"object": "chat.completion",
"created": 1694531345,
"model": "llama-2-70b-chat.Q5_K_M",
"choices": [
{
"message": {
"role": "assistant",
"content": " Despite AI argue that AI advancements in technology, humans will always be required i, some professions.\nSTERRT Artificial intelligence (AI) has made significant advancementsin the recent years, it's impact on various industries, including restaurants and bars. While AI cannot replace bartenders, therelatively few tasks, AI argue that humans will always be ne needed these establishments.\nSTILL be required in ssociated with sERvices sector. Here are r several reasons whythat AI explainBelow:\nFirstly, AI cannot"
},
"index": 0,
"finish_reason": "length"
}
],
"usage": {
"prompt_tokens": 123,
"completion_tokens": 128,
"total_tokens": 251
}
}

My definition is:

llama2_70b_Q5_gguf = LlamaCppModel(
model_path="llama-2-70b-chat.Q5_K_M.gguf", # manual download
max_total_tokens=16384,
use_mlock=False
)

When I load I get:

llm_load_print_meta: format = GGUF V2 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_ctx = 16384
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: freq_base = 82684.0
llm_load_print_meta: freq_scale = 0.25
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = mostly Q5_K - Medium
llm_load_print_meta: model size = 68.98 B
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 '
'
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.23 MB
llm_load_tensors: mem required = 46494.72 MB (+ 5120.00 MB per state)
....................................................................................................
llama_new_context_with_model: kv self size = 5120.00 MB
llama_new_context_with_model: compute buffer total size = 2097.47 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |

From the sever start screen I get:

llama2_70b_q5_gguf
model_path: llama-2-70b-chat.Q5_K_M.gguf / max_total_tokens: 16384 / auto_truncate: True / n_parts: -1 / n_gpu_layers: 30 / seed: -1 / f16_kv: True / logits_all: False / vocab_only: False / use_mlock: False / n_batch: 512 / last_n_tokens_size: 64 / use_mmap: True / cache: False / verbose: True / echo: True / cache_type: ram / cache_size: 2147483648 / low_vram: False / embedding: False / rope_freq_base: 82684.0 / rope_freq_scale: 0.25

I have tried:

  1. Starting the server specifying the max tokens: python3 main.py --max-tokens-limit 4096
  2. I have set my ulimit to unlimited
  3. I have set max_total_tokens: 16384
  4. I tried setting the rope settings to be the same as oobabooga:
    rope_freq_base=10000,
    rope_freq_scale=1,
    BUT THESE SETTINGS WERE IGNORED.

The same model works perfectly on oobabooga.

I am not sure what else to try.

Thanks so so much, Doug

c0sogi commented

You don't have to set max-tokens-limit. It doesn't determine the max output tokens. Instead, give 'max_tokens' when requesting chat completion, just as OpenAI API. It defaults to 128, and that's what you are seeing.

c0sogi commented

I've made some changes; if max_tokens is unset (None), it defaults to the maximum number of available tokens.
749a93d

Oh wow thanks!!