H2O Large Language Model (LLM) Evaluation

In an era where Large Language Models (LLMs) are rapidly gaining traction for diverse applications, the need for comprehensive evaluation and comparison of these models has never been more critical. This repository is an effort in that direction, providing an evaluation method and the toolkit for the assessment of Large Language Models.

Please read the Blog Post for more context.

EvalGPT.ai
Docker Compose Setup
Local Setup
Reproducing Leaderboard
Roadmap

EvalGPT.ai

evalgpt.ai hosts the Leaderboard of some of the top LLMs ranked by their Elo scores. The leaderboard is updated frequently and provides a comprehensive and fair assessment of Large Language Models. Different features of the website are described below.

Elo Leaderboard

The Elo Leaderboard provides a ranking of the top LLMs based on their Elo scores. The Elo scores are computed from the results of A/B tests, wherein the LLMs are pitted against each other in a series of games. The ranking system employed is based on the Elo Rating System. The procedure for Elo score computation closely follows the methodology outlined at this resource.

Prompts

Prompts tab has the list of 60 prompts used to evaluate the LLMs. The prompts are categorized into different categories based on the type of task they are designed for.

Responses

In the Responses section, you can see the responses generated by the LLMs for the prompts. You can also select the LLMs and prompts to compare the responses.

Click on the "Select Models" button to select the LLMs to compare. You can also select a different prompt using the "Previous" and "Next" buttons.

For any two selected models and the prompt, you can see the evaluation by GPT4 by clicking on the "Show GPT Eval" button on the top right.

A/B Tests

"Which is Better: A or B?" provides the interface to perform human evaluation of the LLMs. Each A/B test consists of a prompt and two responses generated by two different LLMs. The user is asked to select the better response among the two.

Docker Compose Setup

1. Clone the repository

git clone https://github.com/h2oai/h2o-LLM-eval.git
cd h2o-LLM-eval

2. Run Docker Compose

docker compose up -d

Navigate to http://localhost:10101/ in your browser

Local Setup

1. Clone the repository

git clone https://github.com/h2oai/h2o-LLM-eval.git

2. Setup Database

a. Create a docker volume for the database

docker volume create llm-eval-db-data

b. Start PostgreSQL 14 in docker

docker run -d --name=llm-eval-db -p 5432:5432 -v llm-eval-db-data:/var/lib/postgresql/data -e POSTGRES_PASSWORD=pgpassword postgres:14

c. Install PostgreSQL client

On Ubuntu:

sudo apt update
sudo apt install postgresql-client

On macOS:

brew install libpq
echo 'export PATH="/usr/local/opt/libpq/bin:$PATH"' >> ~/.zshrc

d. Load the latest data dump into the database

PGPASSWORD=pgpassword psql --host=localhost --port=5432 --username=postgres < data/10_init.sql

3. Setup the environment

The setup is tested on Python 3.10

python -m venv .venv

. .venv/bin/activate

pip install --upgrade pip
pip install -r requirements.txt

4. Run the App

POSTGRES_HOST=localhost POSTGRES_USER=maker POSTGRES_PASSWORD=makerpassword POSTGRES_DB=llm_eval_db H2O_WAVE_NO_LOG=true wave run llm_eval/app.py

Navigate to http://localhost:10101/ in your browser

Reproducing Leaderboard Results

We provide notebooks to generate leaderboard results and reproduce evalgpt.ai.

Run run_all_evaluations.ipynb to evaluate any A/B tests that have not yet been evaluated by a chosen evaluation model and insert the outcomes into the database. An A/B test is considered unevaluated by the given model if no evaluation by the model exists for the given combination of models and prompt. After adding a model, running this evaluates all A/B tests for the model against all other models.
Run all cells in calculate_elo_rating_public_leaderboard.ipynb to get the Elo leaderboard and relevant charts given the evaluations in the database.

Roadmap

Models

Add FreeWilly2 to the Leaderboard

Application

v2 architecture
Option for users to submit new models

Eval

More prompts in each category
Document Q/A and Retrieval Category with ground truth
Document Summarization Category