Fork of LLMPerf optimized for open LLM usage.
git clone https://github.com/liveaverage/llmperf.git
pip install -e llmperf/
This fork of LLMPerf was used to generated the following benchmarks:
We implement 2 tests for evaluating LLMs: a load test to check for performance and a correctness test to check for correctness.
Note: This includes vllm
, Tgi
or NVIDIA NIM Containers.
export OPENAI_API_KEY=secret_abcdefg
export OPENAI_API_BASE="https://api.endpoints.anyscale.com/v1" # or "http://localhost:8000/v1"
python token_benchmark_ray.py \
--model "meta-llama/Meta-Llama-3-8B-Instruct" \
--mean-input-tokens 550 \
--stddev-input-tokens 150 \
--mean-output-tokens 150 \
--stddev-output-tokens 10 \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs" \
--llm-api openai \
--additional-sampling-params '{}'
export HUGGINGFACE_API_KEY="YOUR_HUGGINGFACE_API_KEY" # only for IE and API
# local testing "http://localhost:8000"
# serverless hosted models "https://api-inference.huggingface.co"
# Inference endpoints, e.g. "https://ptrlmejh4tjmcb4t.us-east-1.aws.endpoints.huggingface.cloud"
export HUGGINGFACE_API_BASE="YOUR_HUGGINGFACE_URL"
export MODEL_ID="meta-llama/Llama-2-7b-chat-hf"
python token_benchmark_ray.py \
--model $MODEL_ID \
--mean-input-tokens 550 \
--stddev-input-tokens 150 \
--mean-output-tokens 150 \
--stddev-output-tokens 10 \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs" \
--llm-api huggingface \
--additional-sampling-params '{}'
SageMaker doesn't return the total number of tokens that are generated by their endpoint, so tokens are counted using the LLama tokenizer.
export MESSAGES_API=true
python llmperf/token_benchmark_ray.py \
--model {endpoint_name} \
--llm-api "sagemaker" \
--max-num-completed-requests 500 \
--timeout 600 \
--num-concurrent-requests 25 \
--results-dir "results"
If you have a discrepancy between model and endpoint names you can override the endpoint name via environment variable:
export MESSAGES_API=true
export AWS_SAGEMAKER_EP_MODEL_NAME="meta/llama3-8b-instruct"
python llmperf/token_benchmark_ray.py \
--model {endpoint_name} \
--llm-api "sagemaker" \
--max-num-completed-requests 500 \
--timeout 600 \
--num-concurrent-requests 25 \
--results-dir "results"
NOTE: WIP, not yet tested.
Here, --model is used for logging, not for selecting the model. The model is specified in the Vertex AI Endpoint ID.
The GCLOUD_ACCESS_TOKEN needs to be somewhat regularly set, as the token generated by gcloud auth print-access-token
expires after 15 minutes or so.
Vertex AI doesn't return the total number of tokens that are generated by their endpoint, so tokens are counted using the LLama tokenizer.
gcloud auth application-default login
gcloud config set project YOUR_PROJECT_ID
export GCLOUD_ACCESS_TOKEN=$(gcloud auth print-access-token)
export GCLOUD_PROJECT_ID=YOUR_PROJECT_ID
export GCLOUD_REGION=YOUR_REGION
export VERTEXAI_ENDPOINT_ID=YOUR_ENDPOINT_ID
python token_benchmark_ray.py \
--model "meta-llama/Meta-Llama-3-8B-Instruct" \
--mean-input-tokens 550 \
--stddev-input-tokens 150 \
--mean-output-tokens 150 \
--stddev-output-tokens 10 \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs" \
--llm-api "vertexai" \
--additional-sampling-params '{}'
see python token_benchmark_ray.py --help
for more details on the arguments.
First we need to start TGI:
model=meta-llama/Meta-Llama-3-8B-Instruct
token=$(cat ~/.cache/huggingface/token)
num_shard=1
max_input_length=5000
max_total_tokens=6000
max_batch_prefill_tokens=6144
docker run --gpus $num_shard -ti -p 8080:80 \
-e MODEL_ID=$model \
-e HF_TOKEN=$token \
-e NUM_SHARD=$num_shard \
-e MAX_INPUT_LENGTH=$max_input_length \
-e MAX_TOTAL_TOKENS=$max_total_tokens \
-e MAX_BATCH_PREFILL_TOKENS=$max_batch_prefill_tokens \
ghcr.io/huggingface/text-generation-inference:2.0.3
Test the TGI:
curl http://localhost:8080 \
-X POST \
-d '{"inputs":"What is 10+10?","parameters":{"temperature":0.2, "top_p": 0.95, "max_new_tokens": 256}}' \
-H 'Content-Type: application/json'
Then we can run the benchmark:
HUGGINGFACE_API_BASE="http://localhost:8080"
MODEL_ID="meta-llama/Meta-Llama-3-8B-Instruct"
python token_benchmark_ray.py \
--model $MODEL_ID \
--max-num-completed-requests 100 \
--num-concurrent-requests 10 \
--results-dir "result_outputs" \
--llm-api huggingface
Parse results
python parse_results.py --results-dir "result_outputs"
Results on a 1x A10G GPU:
Avg. Input token length: 550
Avg. Output token length: 150
Avg. First-Time-To-Token: 375.99ms
Avg. Thorughput: 163.23 tokens/sec
Avg. Latency: 38.22ms/token
Results on a 1x H100 GPU with (max_batch_prefill_tokens=16182)
Note: WIP
In this fork we added support to used datasets from Hugging Face to generate the input for the LLM. Dataset should either have a prompt
column or use the messages
format from openai, where then the first user
message will be used as input.
Note: WIP.
curl http://localhost:8000/v1/chat/completions
-H "Content-Type: application/json"
-H "Authorization: Bearer sk"
-d '{
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello!"
}
],
"stream": true
}'