xusenlinzy/api-for-open-llm

你好 llama-cpp 启动方式支持chatglm3-6b吗

lucheng07082221 opened this issue · 1 comments

起始日期 | Start Date

1/12

实现PR | Implementation PR

PORT=80

model related

MODEL_NAME=chatglm3
MODEL_PATH=/workspace/chatglm.cpp/chatglm-ggml.bin
PROMPT_NAME=
EMBEDDING_NAME=

api related

API_PREFIX=/v1

vllm related

ENGINE=llama.cpp
TRUST_REMOTE_CODE=true
TOKENIZE_MODE=slow
TENSOR_PARALLEL_SIZE=1
DTYPE=half

相关Issues | Reference Issues

上面是我的运行示例,chatglm-ggml.bin是我用chatglm.cpp项目转换的模型,但是运行报错
python3 server.py
2024-01-12 09:10:22.356 | DEBUG | api.config::264 - SETTINGS: {
"host": "0.0.0.0",
"port": 80,
"api_prefix": "/v1",
"engine": "llama.cpp",
"model_name": "chatglm3",
"model_path": "/workspace/chatglm.cpp/chatglm-ggml.bin",
"adapter_model_path": null,
"resize_embeddings": false,
"dtype": "half",
"device": "cuda",
"device_map": null,
"gpus": null,
"num_gpus": 1,
"only_embedding": false,
"embedding_name": null,
"embedding_size": -1,
"embedding_device": "cuda",
"quantize": 16,
"load_in_8bit": false,
"load_in_4bit": false,
"using_ptuning_v2": false,
"pre_seq_len": 128,
"context_length": -1,
"chat_template": null,
"patch_type": null,
"alpha": "auto",
"trust_remote_code": true,
"tokenize_mode": "slow",
"tensor_parallel_size": 1,
"gpu_memory_utilization": 0.9,
"max_num_batched_tokens": -1,
"max_num_seqs": 256,
"quantization_method": null,
"use_streamer_v2": false,
"api_keys": null,
"activate_inference": true,
"interrupt_requests": true,
"n_gpu_layers": 0,
"main_gpu": 0,
"tensor_split": null,
"n_batch": 512,
"n_threads": 64,
"n_threads_batch": 64,
"rope_scaling_type": -1,
"rope_freq_base": 0.0,
"rope_freq_scale": 0.0,
"tgi_endpoint": null,
"tei_endpoint": null,
"max_concurrent_requests": 256,
"max_client_batch_size": 32
}
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 8 CUDA devices:
Device 0: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Device 1: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Device 2: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Device 3: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Device 4: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Device 5: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Device 6: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
Device 7: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
gguf_init_from_file: invalid magic characters 'ggml'
error loading model: llama_model_loader: failed to load model from /workspace/chatglm.cpp/chatglm-ggml.bin

llama_load_model_from_file: failed to load model
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
Traceback (most recent call last):
File "/workspace/api-for-open-llm-v3/server.py", line 2, in
from api.models import app, EMBEDDED_MODEL, GENERATE_ENGINE
File "/workspace/api-for-open-llm-v3/api/models.py", line 165, in
GENERATE_ENGINE = create_llama_cpp_engine()
File "/workspace/api-for-open-llm-v3/api/models.py", line 127, in create_llama_cpp_engine
engine = Llama(
File "/usr/local/lib/python3.10/dist-packages/llama_cpp/llama.py", line 962, in init
self._n_vocab = self.n_vocab()
File "/usr/local/lib/python3.10/dist-packages/llama_cpp/llama.py", line 2276, in n_vocab
return self._model.n_vocab()
File "/usr/local/lib/python3.10/dist-packages/llama_cpp/llama.py", line 251, in n_vocab
assert self.model is not None
AssertionError
root@6970820d17b2:/workspace/api-for-open-llm-v3# vim .env
root@6970820d17b2:/workspace/api-for-open-llm-v3#

摘要 | Summary

帮忙看看

基本示例 | Basic Example

是哪里出现问题吗

缺陷 | Drawbacks

求助

未解决问题 | Unresolved questions

No response

https://github.com/ggerganov/llama.cpp 的支持模型里面没有chatglm