使用ggmlv3 q6_K model, inference會掉字

Question

使用ggmlv3 q6_K model, inference會掉字

wennycooper opened this issue a year ago · 3 comments

您好,
我使用ggml quantize 成為 q6_K format, 然後用以下 code 做inference

`
from langchain.llms import LlamaCpp
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

# Callbacks support token-wise streaming
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

# load Llama-2 model
llm = LlamaCpp(
    model_path="/workspace/test/TaiwanLLama_v1.0/Taiwan-LLaMa-13b-1.0.ggmlv3.q6_K.bin",
    n_gpu_layers=16,
    n_batch=8,
    n_ctx=2048,
    temperature=0.1,
    max_tokens=512,
    callback_manager=callback_manager,
)

# response = run_simple_qa(llm, query)
prompt_template = """A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {} ASSISTANT:"""
prompt = prompt_template.format("什麼是深度學習?")
response = llm(prompt)

`
結果會掉字... 如下:

深度學是機器學的一子集，基人工神經結。使得計算機能通別模式大量中學，而不需要明編程。深度學算法用分、進行和別模式

wennycooper commented a year ago

感謝!

Answer 1 · 2023-09-22T01:12:45.000Z

從程式碼裡面的敘述 Callbacks support token-wise streaming 來看，很有可能是 StreamingStdOutCallbackHandler 的問題。你可以尋找看看有沒有 StreamingStdOutCallbackHandler 與 UTF-8 CJK Character 相關的 Issue，或者不要使用 Streaming 等 Inference 完之後把 Response 印出來就好

這類 Streaming 的輸出遇上 BPE Tokenizer 與中日韓文字都很容易發生類似的情況

Answer 2 · 2023-09-22T01:54:10.000Z

@wennycooper 我剛剛實際測試了一下，我發現你使用的模型是 ggml v3 的格式，這個格式已經被 ggml 官方標記為 deprecation
所以我是使用 gguf 格式的，請參考唐鳳的這份 repo
這邊用 q4_0 做測試，看起來輸出是沒有問題，可能跟套件版本有關係

# CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama_cpp_python==0.2.6
# pip install langchain==0.0.298

from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.llms import LlamaCpp

# Callbacks support token-wise streaming
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

# load Llama-2 model
llm = LlamaCpp(
    model_path="/path/to/Taiwan-LLaMa-13b-1.0.Q4_0.gguf",
    n_gpu_layers=100,
    n_batch=8,
    n_ctx=512,
    temperature=0.1,
    max_tokens=512,
    callback_manager=callback_manager,
)

# response = run_simple_qa(llm, query)
prompt_template = """A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {} ASSISTANT:"""
prompt = prompt_template.format("什麼是深度學習?")
response = llm(prompt)
print(response)

Edit: 剛剛測試 gguf q6_k 結果也是沒問題的