xusenlinzy/api-for-open-llm

dcoker部署embedding接口报错:"POST /v1/embeddings HTTP/1.1" 404 Not Found

syusama opened this issue · 2 comments

提交前必须检查以下项目 | The following items must be checked before submission

  • 请确保使用的是仓库最新代码(git pull),一些问题已被解决和修复。 | Make sure you are using the latest code from the repository (git pull), some issues have already been addressed and fixed.
  • 我已阅读项目文档FAQ章节并且已在Issue中对问题进行了搜索,没有找到相似问题和解决方案 | I have searched the existing issues / discussions

问题类型 | Type of problem

模型推理和部署 | Model inference and deployment

操作系统 | Operating system

Linux

详细描述问题 | Detailed description of the problem

通过docker命令启动,

docker compose -f .\docker-compose.vllm.qwen2.yml up -d

服务成功启动,对话也没有问题,但是调用embedding接口会报错
(docker compose 和.env均已配置好本地embedding模型,之前可以成功运行)

"POST /v1/embeddings HTTP/1.1" 404 Not Found

dockers-compose文件

version: '3.10'

services:
  vllmapiserver:
    image: llm-api:vllm
    command: python api/server.py
    ulimits:
      stack: 67108864
      memlock: -1
    environment:
      - PORT=8000
      - MODEL_NAME=qwen2
      - MODEL_PATH=checkpoints/Qwen2-7B-Instruct
      - EMBEDDING_NAME=checkpoints/bce-embedding-base_v1
      - TENSOR_PARALLEL_SIZE=2
      - TRUST_REMOTE_CODE=true
      - PROMPT_NAME=qwen2
    volumes:
      - D:\projects\api-for-open-llm\api-for-open-llm:/workspace
      # model path need to be specified if not in pwd
      - D:\projects\Qwen\models:/workspace/checkpoints
    env_file:
      - .env.qwen2.vllm
    ports:
      - "8053:8000"
    restart: always
    networks:
      - vllmapinet
    shm_size: 200g
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0','1']  # 指定gpu
              capabilities: [gpu]

networks:
  vllmapinet:
    driver: bridge
    name: vllmapinet

.env文件

PORT=8053

# model related
MODEL_NAME=qwen2
MODEL_PATH=D:\projects\Qwen\models\Qwen2-7B-Instruct
PROMPT_NAME=qwen2
EMBEDDING_NAME=D:\projects\Qwen\models\bce-embedding-base_v1
CONTEXT_LEN=2400
DEVICE_MAP=auto
is_qwen_derived_model=false

# api related
API_PREFIX=/v1

# vllm related
ENGINE=vllm
TRUST_REMOTE_CODE=true
TOKENIZE_MODE=slow
TENSOR_PARALLEL_SIZE=2
DTYPE=half

Dependencies

No response

运行日志或截图 | Runtime logs or screenshots

2024-06-08 00:16:48 api-for-open-llm-vllmapiserver-1  | INFO:     Started server process [1]
2024-06-08 00:16:48 api-for-open-llm-vllmapiserver-1  | INFO:     Waiting for application startup.
2024-06-08 00:16:48 api-for-open-llm-vllmapiserver-1  | INFO:     Application startup complete.
2024-06-08 00:16:48 api-for-open-llm-vllmapiserver-1  | INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
2024-06-08 00:17:40 api-for-open-llm-vllmapiserver-1  | 2024-06-07 16:17:40.621 | DEBUG    | api.vllm_routes.chat:create_chat_completion:65 - ==== request ====
2024-06-08 00:17:40 api-for-open-llm-vllmapiserver-1  | {'model': 'Qwen1.5-14B-Chat', 'frequency_penalty': 0.0, 'function_call': None, 'functions': None, 'logit_bias': None, 'logprobs': False, 'max_tokens': 8000, 'n': 1, 'presence_penalty': 0.0, 'response_format': None, 'seed': None, 'stop': ['<|endoftext|>', '<|im_end|>'], 'temperature': 0.01, 'tool_choice': None, 'tools': None, 'top_logprobs': None, 'top_p': 1.0, 'user': None, 'stream': True, 'repetition_penalty': 1.03, 'typical_p': None, 'watermark': False, 'best_of': 1, 'ignore_eos': False, 'use_beam_search': False, 'stop_token_ids': [], 'skip_special_tokens': True, 'spaces_between_special_tokens': True, 'min_p': 0.0, 'include_stop_str_in_output': False, 'length_penalty': 1.0, 'guided_json': None, 'guided_regex': None, 'guided_choice': None, 'guided_grammar': None, 'guided_decoding_backend': None, 'prompt_or_messages': [{'content': '你好', 'role': 'user'}], 'echo': False}
2024-06-08 00:17:41 api-for-open-llm-vllmapiserver-1  | INFO:     172.18.0.1:32992 - "POST /v1/chat/completions HTTP/1.1" 200 OK
2024-06-08 00:29:45 api-for-open-llm-vllmapiserver-1  | INFO:     172.18.0.1:51488 - "POST /v1/embeddings HTTP/1.1" 404 Not Found

要使用embedding的话,环境变量需要修改TASKS=llm,rag,其中llm是指启动大模型,rag是指启动embedding、rerank等rag相关的模型

要使用embedding的话,环境变量需要修改TASKS=llm,rag,其中llm是指启动大模型,rag是指启动embedding、rerank等rag相关的模型
原来是这样,在环境变量中加入TASKS=llm,rag后顺利启动embedding模型,接口也不报错了,感谢大大!