llm-d-bench

Automated llm-d inference benchmarking on OpenShift with MLflow tracking and GitHub Actions integration, by using GuideLLM.

This might work with any other LLM endpoint but has only been tested with llm-d endpoints.

Quick Setup

This project uses the following:

Reflector - Secret and ConfigMap mirroring across namespaces
Kueue - For job batching

1. Deploy Infrastructure

Note

AWS IAM Policy is handled by the user, see mlflow/AWS_IAM_POLICY.md for more.

# Copy and configure environment
cp .env.example .env
# Edit .env with your credentials

# Deploy MLflow, PostgreSQL, and GitHub runners
./bootstrap.sh

# Dry run
./bootstrap.sh --dry-run

This deploys:

MLflow - Experiment tracking with PostgreSQL backend and S3 storage
Self-hosted GitHub runners - Run benchmarks via PR comments
Custom benchmark image - Built and pushed to OpenShift registry
All needed addons/operators (Kueue, Reflector).

2. Run Benchmarks

Via GitHub Actions (recommended):

# Comment on any PR:
/benchmark qwen-0.6b-baseline

# With parameter overrides:
/benchmark qwen-0.6b-baseline
benchmark.maxSeconds=600

Warning

This repo does not handle llm-d deployment, so you need to make sure which model is running to make sure the benchmark succeeds.

Via Helm:

helm install <your_deployment_name> ./llm-d-bench \
  -f llm-d-bench/experiments/qwen-0.6b-baseline.yaml \
  -n <your_namespace>

Adding Benchmarks

See llm-d-bench/ADDING_BENCHMARKS.md for adding new benchmark tools.

Quick summary:

Add benchmark implementation to llm-d-bench/templates/benchmarks/<tool-name>/
Create experiment config in llm-d-bench/experiments/
Trigger via /benchmark <experiment-name> in PR comments

For new experiments, add them in llm-d-bench/experiments.

Note

Experiment names cannot include . for security reasons.

GitHub Action Workflow

The benchmark workflow (.github/workflows/benchmark.yaml) triggers on PR comments:

How it works:

User comments /benchmark <experiment> on a PR
Self-hosted runner picks up the job
Checks out PR branch
Runs Helm install with experiment config
Waits for job completion (up to 12 hours)
Reacts with 🚀 on success or 😕 on failure

Requirements:

Self-hosted runner with label openshift
GitHub environment named benchmark
OpenShift secrets: OPENSHIFT_SERVER_URL, OPENSHIFT_CA_CERT, OPENSHIFT_TOKEN
Only repository owner can trigger benchmarks

Configuration

Environment Variables (.env)

MLflow:

POSTGRES_PASSWORD=your-password
AWS_ACCESS_KEY_ID=your-key
AWS_SECRET_ACCESS_KEY=your-secret
S3_BUCKET_NAME=your-bucket
AWS_REGION=us-east-1
MLFLOW_ADMIN_PASSWORD=your-password

GitHub Runners:

GITHUB_TOKEN=ghp_your_token
GITHUB_OWNER=your-org-or-username
GITHUB_REPOSITORY=                    # Empty for org-wide runners
RUNNER_LABELS=openshift,self-hosted
RUNNER_REPLICAS=2

Benchmark Parameters

Key parameters in llm-d-bench/values.yaml:

benchmark.target - Target inference endpoint
benchmark.model - Model name
benchmark.rate - Concurrent request rates (e.g., {1,50,100})
benchmark.data - Number of requests or token specs
benchmark.maxSeconds - Max runtime (default: 600s)
mlflow.enabled - Enable MLflow tracking
kueue.enabled - Enable Kueue queues

Results

MLflow - Experiments tracked if mlflow.enabled=True

Access MLflow UI:

oc get route mlflow -n mlflow -o jsonpath='{.spec.host}'
# Login with credentials from .env

albertoperdomo2/llm-d-bench