ymcui/Chinese-LLaMA-Alpaca-2

运行时显存占用过大和没有获取json返回体

xiaoToby opened this issue · 17 comments

提交前必须检查以下项目

  • 请确保使用的是仓库最新代码(git pull),一些问题已被解决和修复。
  • 我已阅读项目文档FAQ章节并且已在Issue中对问题进行了搜索,没有找到相似问题和解决方案。
  • 第三方插件问题:例如llama.cppLangChaintext-generation-webui等,同时建议到对应的项目中查找解决方案。

问题类型

效果问题

基础模型

Chinese-Alpaca-2 (7B/13B)

操作系统

Linux

详细描述问题

本地部署了chinese-alpaca-2-7b模型之后,测试使用scripts/openai_server_demo/openai_api_server.py
并用一下指令测试:
curl http://localhost:19327/v1/chat/completions \

-H "Content-Type: application/json"
-d '{
"messages": [
{"role": "user","content": "给我讲一些有关杭州的故事吧"}
],
"repetition_penalty": 1.0
}

  1. 首先使用了GPU,发现显存占用过高,报错
  2. 使用--only_gpu, 没有得到期望的回答

问题:
1.关于占用gpu显存过高问题,有没有优化的方法
2.如何能得到期待的问答式回复

依赖情况(代码类问题务必提供)

# 请在此处粘贴依赖情况(请粘贴在本代码块里)

运行日志或截图

image

  1. 尝试使用4bit/8bit加载模型推理;使用flash-attn2或sdpa加载推理;设置gpus为机器上的所有卡。
  2. 不清楚你是否正确使用了模型与模板,建议贴出详细的运行命令。
  1. 尝试使用4bit/8bit加载模型推理;使用flash-attn2或sdpa加载推理;设置gpus为机器上的所有卡。
  2. 不清楚你是否正确使用了模型与模板,建议贴出详细的运行命令。

image
@iMountTai

我想问一下,这些模型运行的显存要求是什么,我现在用的是一张12g的gpu,似乎是不够的

&B模型本身权重大小就有14G左右,12g的gpu肯定是不够的,而且cpu推理太慢,建议使用llama.cpp体验。

我增加了gpu,现在能用gpu运行模型,但还是没有得到json返回体

2024-02-21 05:54:04,403 - ERROR - Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/h11_impl.py", line 412, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in call
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in call
await super().call(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in call
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in call
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in call
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 83, in call
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in call
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 758, in call
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 778, in app
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 299, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 79, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
response = await func(request)
File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 299, in app
raise e
File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 294, in app
raw_response = await run_endpoint_function(
File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
return await dependant.call(**values)
File "/home/Chinese-LLaMA-Alpaca-2/scripts/openai_server_demo/openai_api_server.py", line 354, in create_completion
output = predict(
File "/home/Chinese-LLaMA-Alpaca-2/scripts/openai_server_demo/openai_api_server.py", line 206, in predict
generation_output = model.generate(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1789, in generate
return self.beam_sample(
File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 3501, in beam_sample
if beam_scorer.is_done or stopping_criteria(input_ids, scores):
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

试一下:

curl http://localhost:19327/v1/completions \
-H "Content-Type: application/json" \
-d '{
"prompt": "给我讲一些有关杭州的故事吧"
}'

curl http://localhost:19327/v1/completions
-H "Content-Type: application/json"
-d '{
"prompt": "给我讲一些有关杭州的故事吧"
}'

image
报错信息和之前是一样的

贴一下现在的运行命令,指启动服务的命令

贴一下现在的运行命令,指启动服务的命令

image
image

python scripts/openai_server_demo/openai_api_server.py --base_model models/ --gpus 0

这样就可以了

好的,确实一张卡就足够使用了。我这边测试了三张卡正常运行,可能具体环境存在差异。

为什么使用多张gpu,就会出现如下问题:
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
@iMountTai

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.

Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.