xusenlinzy/api-for-open-llm

TypeError: '>=' not supported between instances of 'RuntimeError' and 'int'

wangjiainchinatelecom opened this issue · 7 comments

提交前必须检查以下项目 | The following items must be checked before submission

  • 请确保使用的是仓库最新代码(git pull),一些问题已被解决和修复。 | Make sure you are using the latest code from the repository (git pull), some issues have already been addressed and fixed.
  • 我已阅读项目文档FAQ章节并且已在Issue中对问题进行了搜索,没有找到相似问题和解决方案 | I have searched the existing issues / discussions

问题类型 | Type of problem

模型推理和部署 | Model inference and deployment

操作系统 | Operating system

Linux

详细描述问题 | Detailed description of the problem

""" https://github.com/facebookresearch/codellama/blob/main/example_completion.py """

from langchain.llms import OpenAI

llm = OpenAI(
model_name="code-llama",
openai_api_base="http://localhost:8000/v1",
openai_api_key="xxx",
)

def test():
# For these prompts, the expected answer is the natural continuation of the prompt
prompts = [
"""
import socket

def ping_exponential_backoff(host: str):""",
"""
import argparse

def main(string: str):
print(string)
print(string[::-1])

if name == "main":"""
]

for prompt in prompts:
    result = llm(prompt)
    print(prompt)
    print(f"> {result}")
    print("\n==================================\n")

if name == "main":
test()

Dependencies

peft 0.6.2
sentence-transformers 2.2.2
torch 2.1.1
torchvision 0.16.1
transformers 4.33.2
transformers-stream-generator 0.0.4

运行日志或截图 | Runtime logs or screenshots

python server.py
2023-11-21 09:57:39.449 | DEBUG | api.config::130 - Config: {'HOST': '0.0.0.0', 'PORT': 8000, 'MODEL_NAME': 'code-llama', 'MODEL_PATH': '/huggingface/models/CodeLlama-13b-Python-hf', 'ADAPTER_MODEL_PATH': None, 'RESIZE_EMBEDDINGS': False, 'DEVICE': 'cuda', 'DEVICE_MAP': None, 'GPUS': '2,3', 'NUM_GPUs': 2, 'ONLY_EMBEDDING': False, 'EMBEDDING_NAME': '/huggingface/m3e-base', 'EMBEDDING_SIZE': None, 'EMBEDDING_DEVICE': 'cuda', 'QUANTIZE': 16, 'LOAD_IN_8BIT': False, 'LOAD_IN_4BIT': False, 'USING_PTUNING_V2': False, 'CONTEXT_LEN': None, 'STREAM_INTERVERL': 2, 'PROMPT_NAME': None, 'PATCH_TYPE': None, 'ALPHA': 'auto', 'API_PREFIX': '/v1', 'USE_VLLM': False, 'TRUST_REMOTE_CODE': False, 'TOKENIZE_MODE': 'auto', 'TENSOR_PARALLEL_SIZE': 1, 'DTYPE': 'half', 'GPU_MEMORY_UTILIZATION': 0.9, 'MAX_NUM_BATCHED_TOKENS': None, 'MAX_NUM_SEQS': 256, 'QUANTIZATION_METHOD': None, 'USE_STREAMER_V2': False, 'API_KEYS': None, 'ACTIVATE_INFERENCE': True}
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:07<00:00, 2.34s/it]
2023-11-21 09:57:51.316 | INFO | api.generation.core:fix_tokenizer:83 - Add pad token:
INFO: Started server process [15967]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO: 127.0.0.1:55004 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/huggingface/code/api-for-open-llm/api/generation/core.py", line 101, in generate_stream_gate_v1
for output in self.generate_stream_func(
File "/huggingface/miniconda3/envs/api-for-open-llm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 35, in generator_context
response = gen.send(None)
File "/huggingface/code/api-for-open-llm/api/generation/stream.py", line 95, in generate_stream
out = model(torch.as_tensor([input_ids], device=device), use_cache=True)
File "/huggingface/miniconda3/envs/api-for-open-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/huggingface/miniconda3/envs/api-for-open-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/huggingface/miniconda3/envs/api-for-open-llm/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward
output = module._old_forward(*args, **kwargs)
File "/huggingface/miniconda3/envs/api-for-open-llm/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 820, in forward
outputs = self.model(
File "/huggingface/miniconda3/envs/api-for-open-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/huggingface/miniconda3/envs/api-for-open-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/huggingface/miniconda3/envs/api-for-open-llm/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 708, in forward
layer_outputs = decoder_layer(
File "/huggingface/miniconda3/envs/api-for-open-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/huggingface/miniconda3/envs/api-for-open-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/huggingface/miniconda3/envs/api-for-open-llm/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward
output = module._old_forward(*args, **kwargs)
File "/huggingface/miniconda3/envs/api-for-open-llm/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 424, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/huggingface/miniconda3/envs/api-for-open-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/huggingface/miniconda3/envs/api-for-open-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/huggingface/miniconda3/envs/api-for-open-llm/lib/python3.10/site-packages/accelerate/hooks.py", line 164, in new_forward
output = module._old_forward(*args, **kwargs)
File "/huggingface/miniconda3/envs/api-for-open-llm/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 346, in forward
attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling cublasGemmStridedBatchedExFix( handle, opa, opb, m, n, k, (void*)(&falpha), a, CUDA_R_16F, lda, stridea, b, CUDA_R_16F, ldb, strideb, (void*)(&fbeta), c, CUDA_R_16F, ldc, stridec, num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/huggingface/miniconda3/envs/api-for-open-llm/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/huggingface/miniconda3/envs/api-for-open-llm/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in call
return await self.app(scope, receive, send)
File "/huggingface/miniconda3/envs/api-for-open-llm/lib/python3.10/site-packages/fastapi/applications.py", line 276, in call
await super().call(scope, receive, send)
File "/huggingface/miniconda3/envs/api-for-open-llm/lib/python3.10/site-packages/starlette/applications.py", line 122, in call
await self.middleware_stack(scope, receive, send)
File "/huggingface/miniconda3/envs/api-for-open-llm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in call
raise exc
File "/huggingface/miniconda3/envs/api-for-open-llm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in call
await self.app(scope, receive, _send)
File "/huggingface/miniconda3/envs/api-for-open-llm/lib/python3.10/site-packages/starlette/middleware/cors.py", line 84, in call
await self.app(scope, receive, send)
File "/huggingface/miniconda3/envs/api-for-open-llm/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in call
raise exc
File "/huggingface/miniconda3/envs/api-for-open-llm/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in call
await self.app(scope, receive, sender)
File "/huggingface/miniconda3/envs/api-for-open-llm/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 21, in call
raise e
File "/huggingface/miniconda3/envs/api-for-open-llm/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in call
await self.app(scope, receive, send)
File "/huggingface/miniconda3/envs/api-for-open-llm/lib/python3.10/site-packages/starlette/routing.py", line 718, in call
await route.handle(scope, receive, send)
File "/huggingface/miniconda3/envs/api-for-open-llm/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
await self.app(scope, receive, send)
File "/huggingface/miniconda3/envs/api-for-open-llm/lib/python3.10/site-packages/starlette/routing.py", line 66, in app
response = await func(request)
File "/huggingface/miniconda3/envs/api-for-open-llm/lib/python3.10/site-packages/fastapi/routing.py", line 237, in app
raw_response = await run_endpoint_function(
File "/huggingface/miniconda3/envs/api-for-open-llm/lib/python3.10/site-packages/fastapi/routing.py", line 163, in run_endpoint_function
return await dependant.call(**values)
File "/huggingface/code/api-for-open-llm/api/routes/completion.py", line 59, in create_completion
content = GENERATE_MDDEL.generate_gate(gen_params)
File "/huggingface/code/api-for-open-llm/api/generation/core.py", line 164, in generate_gate
for x in self.generate_stream_gate(params):
File "/huggingface/code/api-for-open-llm/api/generation/core.py", line 94, in generate_stream_gate
yield from self.generate_stream_gate_v1(params)
File "/huggingface/code/api-for-open-llm/api/generation/core.py", line 129, in generate_stream_gate_v1
traceback.print_exc(e)
File "/huggingface/miniconda3/envs/api-for-open-llm/lib/python3.10/traceback.py", line 179, in print_exc
print_exception(*sys.exc_info(), limit=limit, file=file, chain=chain)
File "/huggingface/miniconda3/envs/api-for-open-llm/lib/python3.10/traceback.py", line 119, in print_exception
te = TracebackException(type(value), value, tb, limit=limit, compact=True)
File "/home/wangjia/.local/lib/python3.10/site-packages/exceptiongroup/_formatting.py", line 96, in init
self.stack = traceback.StackSummary.extract(
File "/huggingface/miniconda3/envs/api-for-open-llm/lib/python3.10/traceback.py", line 357, in extract
if limit >= 0:
TypeError: '>=' not supported between instances of 'RuntimeError' and 'int'

看起来好像是环境的问题,我使用docker是没问题的

我没有使用docker,用的是miniconda,但是同样的环境运行ChatGLM3-6b就是正常的

我也遇到这个问题,我用的是Qwen, 我使用的场景是用api-for-open-llm来模型openai的接口,然后fastgpt来调用模型

apiserver_1 | Warning: please make sure that you are using the latest codes and checkpoints, especially if you used Qwen-7B before 09.25.2023.请使用最新模型和代码,尤其如果你在9月25日前已经开始使用Qwen-7B,千万注意不要使用错误代码和模型。
apiserver_1 | Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
apiserver_1 | Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
Loading checkpoint shards: 100%|██████████| 3/3 [00:33<00:00, 11.22s/it]
apiserver_1 | 2023-11-23 08:35:15.491 | INFO | api.generation.core:init:58 - Using Qwen Model for Chat!
apiserver_1 | 2023-11-23 08:35:15.491 | INFO | api.generation.core:fix_tokenizer:76 - Add eos token: <|endoftext|>
apiserver_1 | 2023-11-23 08:35:15.491 | INFO | api.generation.core:fix_tokenizer:83 - Add pad token: <|endoftext|>
apiserver_1 | INFO: Started server process [1]
apiserver_1 | INFO: Waiting for application startup.
apiserver_1 | INFO: Application startup complete.
apiserver_1 | INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
apiserver_1 | INFO: 172.16.5.1:36186 - "POST /v1/embeddings HTTP/1.1" 200 OK
apiserver_1 | 2023-11-23 08:35:25.796 | DEBUG | api.routes.chat:create_chat_completion:67 - ==== request ====
apiserver_1 | {'model': 'gpt-3.5-turbo-16k', 'frequency_penalty': 0.0, 'function_call': None, 'functions': None, 'logit_bias': None, 'max_tokens': 15993, 'n': 1, 'presence_penalty': 0.0, 'response_format': None, 'seed': 1, 'stop': ['<|endoftext|>', '<|im_end|>'], 'temperature': 0.01, 'tool_choice': None, 'tools': None, 'top_p': 1.0, 'user': None, 'stream': True, 'prompt': [{'content': '你好', 'role': 'user'}], 'echo': False, 'stop_token_ids': [151643, 151644, 151645]}
apiserver_1 | INFO: 172.16.5.1:36186 - "POST /v1/chat/completions HTTP/1.1" 200 OK
apiserver_1 | ERROR: Exception in ASGI application
apiserver_1 | Traceback (most recent call last):
apiserver_1 | File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
apiserver_1 | result = await app( # type: ignore[func-returns-value]
apiserver_1 | File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in call
apiserver_1 | return await self.app(scope, receive, send)
apiserver_1 | File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 276, in call
apiserver_1 | await super().call(scope, receive, send)
apiserver_1 | File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 122, in call
apiserver_1 | await self.middleware_stack(scope, receive, send)
apiserver_1 | File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 184, in call
apiserver_1 | raise exc
apiserver_1 | File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 162, in call
apiserver_1 | await self.app(scope, receive, _send)
apiserver_1 | File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 84, in call
apiserver_1 | await self.app(scope, receive, send)
apiserver_1 | File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 79, in call
apiserver_1 | raise exc
apiserver_1 | File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 68, in call
apiserver_1 | await self.app(scope, receive, sender)
apiserver_1 | File "/usr/local/lib/python3.10/dist-packages/fastapi/middleware/asyncexitstack.py", line 21, in call
apiserver_1 | raise e
apiserver_1 | File "/usr/local/lib/python3.10/dist-packages/fastapi/middleware/asyncexitstack.py", line 18, in call
apiserver_1 | await self.app(scope, receive, send)
apiserver_1 | File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 718, in call
apiserver_1 | await route.handle(scope, receive, send)
apiserver_1 | File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 276, in handle
apiserver_1 | await self.app(scope, receive, send)
apiserver_1 | File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 69, in app
apiserver_1 | await response(scope, receive, send)
apiserver_1 | File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 270, in call
apiserver_1 | async with anyio.create_task_group() as task_group:
apiserver_1 | File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 597, in aexit
apiserver_1 | raise exceptions[0]
apiserver_1 | File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 273, in wrap
apiserver_1 | await func()
apiserver_1 | File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 262, in stream_response
apiserver_1 | async for chunk in self.body_iterator:
apiserver_1 | File "/workspace/api/routes/chat.py", line 131, in chat_completion_stream_generator
apiserver_1 | for content in GENERATE_MDDEL.generate_stream_gate(gen_params):
apiserver_1 | File "/workspace/api/generation/core.py", line 94, in generate_stream_gate
apiserver_1 | yield from self.generate_stream_gate_v1(params)
apiserver_1 | File "/workspace/api/generation/core.py", line 129, in generate_stream_gate_v1
apiserver_1 | traceback.print_exc(e)
apiserver_1 | File "/usr/lib/python3.10/traceback.py", line 179, in print_exc
apiserver_1 | print_exception(*sys.exc_info(), limit=limit, file=file, chain=chain)
apiserver_1 | File "/usr/lib/python3.10/traceback.py", line 119, in print_exception
apiserver_1 | te = TracebackException(type(value), value, tb, limit=limit, compact=True)
apiserver_1 | File "/usr/local/lib/python3.10/dist-packages/exceptiongroup/_formatting.py", line 96, in init
apiserver_1 | self.stack = traceback.StackSummary.extract(
apiserver_1 | File "/usr/lib/python3.10/traceback.py", line 357, in extract
apiserver_1 | if limit >= 0:
apiserver_1 | TypeError: '>=' not supported between instances of 'RuntimeError' and 'int'

我也是这个问题,有解决的吗

没有解决,我换成vllm了

我发现如果历史记录超过3条就会出现这个问题,不知道是什么原因

我发现如果历史记录超过3条就会出现这个问题,不知道是什么原因

请问解决了吗