A tool of model coverage test for vLLM.
- Out of Box with rocm/vllm-dev docker image which have all dependcy pakages pre-installed.
- Work for ROCm and CUDA
- CSV file in with bacth model in list and gpus allocated
- CSV file out with test results [PASS|FAILED]
- Log file out with prefix in line for debugging
- Support TP and Multiple TP by set
gpusin the input model_list CSV
Here we use rocm/vllm-dev:20250112 as example. You should refer the commands bellow to start the vLLM container.
docker run rocm/vllm-dev:20250112
docker run -it --rm --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --shm-size 32G --hostname=vLLM-CT -v $PWD:/ws -w /ws rocm/vllm-dev:20250112 /bin/bashLogin with your HF account for model downloading
huggingface-cli logingit clone https://github.com/alexhegit/vLLM_ModelCoverageTest.git
cd vLLM_ModelCoverageTestQuick Test
python3 ModelCoverageTest.py --csv model_list.csvThe test finished with two output files.
- Add the new column 'status' with 'PASS' or 'FAILED' output from the compatiable test
- mct-[yyyymmdd].log for checking the detail of the test expecial for the error.
Get help for the tool usage,
python ModelCoverageTest.py --help
usage: ModelCoverageTest.py [-h] [--csv CSV]
Test models with specified prompts from a CSV file.
options:
-h, --help show this help message and exit
--csv CSV Path to the CSV file containing model_id and gpusHere a example csv file defined the model list with model id(of Huggingface) and the number of GPUs used for test. The number of GPUS define in gpus is the value of -tp of vLLM inferfence engine for tensor parallel running.
# cat model_list.csv
model_id,gpus
facebook/opt-125m,1
meta-llama/Llama-3.2-11B-Vision,1
deepseek-ai/DeepSeek-V2,8
meta-llama/Llama-3.2-90B-Vision-Instruct,4
microsoft/Phi-3-vision-128k-instruct,1There are to output file for checking the test results.
- [modle_list]_results.csv
# cat model_list_results.csv
model_id,gpus,status
facebook/opt-125m,1,PASS
...
- mt-yyyymmdd.log There is a prefix for quick filter out the log print by vLLM MCT. You should check the what happend in detail for the model run FAILED.
- Some model like LLama need to request access at first. You may check the error from the log if not have.
- You should try multiple tensor parallel if LLM is OOM with single GPU.
- The vLMM CMT will follow the process [download LLM | inference LLM | delete LLM] to save disk space and avoid run failed cuased by the continue enlarged .cache/huggingface/hub directory.