Unexpected Prompt Results on GPU Execution for Bllossom/llama-3-Korean-Bllossom-70B-gguf-Q4_K_M

Question

Unexpected Prompt Results on GPU Execution for Bllossom/llama-3-Korean-Bllossom-70B-gguf-Q4_K_M

Opened this issue 7 months ago · 0 comments

When running the Bllossom/llama-3-Korean-Bllossom-70B-gguf-Q4_K_M model on CPU, the prompts work as expected.
However, when running the same model on GPU, the prompts produce incorrect results.
Is this a known issue?

source code

import os

from llama_cpp import Llama
from transformers import AutoTokenizer

model_id = 'Bllossom/llama-3-Korean-Bllossom-70B-gguf-Q4_K_M'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = Llama(
    model_path='models/llama-3-Korean-Bllossom-70B-gguf-Q4_K_M.gguf',
    n_ctx=512,
    n_gpu_layers=-1     # Number of model layers to offload to GPU
)

PROMPT = \
'''당신은 유용한 AI 어시스턴트입니다. 사용자의 질의에 대해 친절하고 정확하게 답변해야 합니다.
You are a helpful AI assistant, you'll need to answer users' queries in a friendly and accurate manner.'''

instruction = '2x + 3 = 7이라면 x는?'

messages = [
    {"role": "system", "content": f"{PROMPT}"},
    {"role": "user", "content": f"{instruction}"}
    ]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt=True
)

generation_kwargs = {
    "max_tokens":512,
    "stop":["<|eot_id|>"],
    "echo":True, # Echo the prompt in the output
    "top_p":0.9,
    "temperature":0.6,
}

resonse_msg = model(prompt, **generation_kwargs)
print(resonse_msg['choices'][0]['text'])

Prompt Results

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

당신은 유용한 AI 어시스턴트입니다. 사용자의 질의에 대해 친절하고 정확하게 답변해야 합니다.<|eot_id|><|start_header_id|>user<|end_header_id|>

2x + 3 = 7이라면 x는?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

어떤 것이 있습니다.

1)