vLLM_ModelCoverageTest

A tool of model coverage test for vLLM.

Key Features

Out of Box with rocm/vllm-dev docker image which have all dependcy pakages pre-installed.
Work for ROCm and CUDA
CSV file in with bacth model in list and gpus allocated
CSV file out with test results [PASS|FAILED]
Log file out with prefix in line for debugging
Support TP and Multiple TP by set gpus in the input model_list CSV

Steps

Start the docker container

Here we use rocm/vllm-dev:20250112 as example. You should refer the commands bellow to start the vLLM container.

docker run rocm/vllm-dev:20250112

docker run -it --rm --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --shm-size 32G --hostname=vLLM-CT -v $PWD:/ws -w /ws rocm/vllm-dev:20250112 /bin/bash

Run in the container

huggingface-cli login

Clone this repo

git clone https://github.com/alexhegit/vLLM_ModelCoverageTest.git
cd vLLM_ModelCoverageTest

Test

Quick Test

python3 ModelCoverageTest.py --csv model_list.csv

The test finished with two output files.

Add the new column 'status' with 'PASS' or 'FAILED' output from the compatiable test
mct-[yyyymmdd].log for checking the detail of the test expecial for the error.

Get help for the tool usage,

python ModelCoverageTest.py --help

usage: ModelCoverageTest.py [-h] [--csv CSV]

Test models with specified prompts from a CSV file.

options:
  -h, --help  show this help message and exit
  --csv CSV   Path to the CSV file containing model_id and gpus

Here a example csv file defined the model list with model id(of Huggingface) and the number of GPUs used for test. The number of GPUS define in gpus is the value of -tp of vLLM inferfence engine for tensor parallel running.

# cat model_list.csv
model_id,gpus
facebook/opt-125m,1
meta-llama/Llama-3.2-11B-Vision,1
deepseek-ai/DeepSeek-V2,8
meta-llama/Llama-3.2-90B-Vision-Instruct,4
microsoft/Phi-3-vision-128k-instruct,1

There are to output file for checking the test results.

[modle_list]_results.csv

# cat model_list_results.csv
model_id,gpus,status
facebook/opt-125m,1,PASS
...

mt-yyyymmdd.log There is a prefix for quick filter out the log print by vLLM MCT. You should check the what happend in detail for the model run FAILED.

NOTES & FAQ

Some model like LLama need to request access at first. You may check the error from the log if not have.
You should try multiple tensor parallel if LLM is OOM with single GPU.
The vLMM CMT will follow the process [download LLM | inference LLM | delete LLM] to save disk space and avoid run failed cuased by the continue enlarged .cache/huggingface/hub directory.