/llmperf-mod

Primary LanguagePythonApache License 2.0Apache-2.0

LLMPerf

Fork of LLMPerf optimized for open LLM usage.

Installation

git clone https://github.com/liveaverage/llmperf.git 
pip install -e llmperf/

Benchmarks

This fork of LLMPerf was used to generated the following benchmarks:

Basic Usage

We implement 2 tests for evaluating LLMs: a load test to check for performance and a correctness test to check for correctness.

OpenAI Compatible APIs

Note: This includes vllm, Tgi or NVIDIA NIM Containers.

export OPENAI_API_KEY=secret_abcdefg
export OPENAI_API_BASE="https://api.endpoints.anyscale.com/v1" # or "http://localhost:8000/v1"

python token_benchmark_ray.py \
--model "meta-llama/Meta-Llama-3-8B-Instruct" \
--mean-input-tokens 550 \
--stddev-input-tokens 150 \
--mean-output-tokens 150 \
--stddev-output-tokens 10 \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs" \
--llm-api openai \
--additional-sampling-params '{}'

Hugging Face (TGI)

export HUGGINGFACE_API_KEY="YOUR_HUGGINGFACE_API_KEY" # only for IE and API
# local testing "http://localhost:8000"
# serverless hosted models "https://api-inference.huggingface.co"
# Inference endpoints, e.g. "https://ptrlmejh4tjmcb4t.us-east-1.aws.endpoints.huggingface.cloud"
export HUGGINGFACE_API_BASE="YOUR_HUGGINGFACE_URL"
export MODEL_ID="meta-llama/Llama-2-7b-chat-hf"

python token_benchmark_ray.py \
--model $MODEL_ID \
--mean-input-tokens 550 \
--stddev-input-tokens 150 \
--mean-output-tokens 150 \
--stddev-output-tokens 10 \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs" \
--llm-api huggingface \
--additional-sampling-params '{}'

SageMaker (TGI)

SageMaker doesn't return the total number of tokens that are generated by their endpoint, so tokens are counted using the LLama tokenizer.

export MESSAGES_API=true
python llmperf/token_benchmark_ray.py \
--model {endpoint_name} \
--llm-api "sagemaker" \
--max-num-completed-requests 500 \
--timeout 600 \
--num-concurrent-requests 25 \
--results-dir "results"

If you have a discrepancy between model and endpoint names you can override the endpoint name via environment variable:

export MESSAGES_API=true
export AWS_SAGEMAKER_EP_MODEL_NAME="meta/llama3-8b-instruct"
python llmperf/token_benchmark_ray.py \
--model {endpoint_name} \
--llm-api "sagemaker" \
--max-num-completed-requests 500 \
--timeout 600 \
--num-concurrent-requests 25 \
--results-dir "results"

Vertex AI

NOTE: WIP, not yet tested.

Here, --model is used for logging, not for selecting the model. The model is specified in the Vertex AI Endpoint ID.

The GCLOUD_ACCESS_TOKEN needs to be somewhat regularly set, as the token generated by gcloud auth print-access-token expires after 15 minutes or so.

Vertex AI doesn't return the total number of tokens that are generated by their endpoint, so tokens are counted using the LLama tokenizer.

gcloud auth application-default login
gcloud config set project YOUR_PROJECT_ID

export GCLOUD_ACCESS_TOKEN=$(gcloud auth print-access-token)
export GCLOUD_PROJECT_ID=YOUR_PROJECT_ID
export GCLOUD_REGION=YOUR_REGION
export VERTEXAI_ENDPOINT_ID=YOUR_ENDPOINT_ID

python token_benchmark_ray.py \
--model "meta-llama/Meta-Llama-3-8B-Instruct" \
--mean-input-tokens 550 \
--stddev-input-tokens 150 \
--mean-output-tokens 150 \
--stddev-output-tokens 10 \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs" \
--llm-api "vertexai" \
--additional-sampling-params '{}'

see python token_benchmark_ray.py --help for more details on the arguments.

Examples and other use cases

End to End Test for llama 3 8b instruct

First we need to start TGI:

model=meta-llama/Meta-Llama-3-8B-Instruct
token=$(cat ~/.cache/huggingface/token)
num_shard=1
max_input_length=5000
max_total_tokens=6000
max_batch_prefill_tokens=6144
docker run --gpus $num_shard -ti -p 8080:80 \
  -e MODEL_ID=$model \
  -e HF_TOKEN=$token \
  -e NUM_SHARD=$num_shard \
  -e MAX_INPUT_LENGTH=$max_input_length \
  -e MAX_TOTAL_TOKENS=$max_total_tokens \
  -e MAX_BATCH_PREFILL_TOKENS=$max_batch_prefill_tokens \
  ghcr.io/huggingface/text-generation-inference:2.0.3

Test the TGI:

curl http://localhost:8080 \
    -X POST \
    -d '{"inputs":"What is 10+10?","parameters":{"temperature":0.2, "top_p": 0.95, "max_new_tokens": 256}}' \
    -H 'Content-Type: application/json'

Then we can run the benchmark:

HUGGINGFACE_API_BASE="http://localhost:8080"
MODEL_ID="meta-llama/Meta-Llama-3-8B-Instruct"
python token_benchmark_ray.py \
--model $MODEL_ID \
--max-num-completed-requests 100 \
--num-concurrent-requests 10 \
--results-dir "result_outputs" \
--llm-api huggingface 

Parse results

python parse_results.py --results-dir "result_outputs"

Results on a 1x A10G GPU:

Avg. Input token length: 550
Avg. Output token length: 150
Avg. First-Time-To-Token: 375.99ms
Avg. Thorughput: 163.23 tokens/sec
Avg. Latency: 38.22ms/token

Results on a 1x H100 GPU with (max_batch_prefill_tokens=16182)

Speculative Decoding

Note: WIP

Use Hugging Face Dataset

In this fork we added support to used datasets from Hugging Face to generate the input for the LLM. Dataset should either have a prompt column or use the messages format from openai, where then the first user message will be used as input.

Note: WIP.

curl http://localhost:8000/v1/chat/completions
-H "Content-Type: application/json"
-H "Authorization: Bearer sk"
-d '{ "model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "Hello!" } ], "stream": true }'