LaPET Overview

Public LLM leaderboards like Huggingface are great for getting a general idea of which LLM models perform well. However, this is not useful when we need to evaluate models for specific LLM generative tasks. The LMSys Chatbot Arena does provide interesting results but is too general.

LaPET stands for Language Pairwise Evaluation Toolkit and is targeted at users that need to know how well a model will work for a specific task like summarizing a customer service call or putting together an action plan to resolve a customer issue or analyzing a spreadsheet for inconsistencies. These real world tasks require an evaluation method that is easy to utilize for any kind of user, whether you want to create your own LLM benchmark or use data from ours.

The purpose of this library is to make it easier to evaluate the quality of LLM outputs from multiple models across a set of user selectable tasks. LLM outputs are evaluated using LLM as a judge (GPT4o).

How it Works

LaPET does a pairwise preference evaluation for every possible pair of LLM outputs. Users define a set of prompts for generation, the number of samples they would like to use, and which (supported) models they would like to evaluate. We randomize the model output order (first or second) to reduce the change of positional preference. We also try to eliminate any extra language that might affect preference based on output length. Both the LLM outputs and LLM as a judge evaluations are stored in CSV files for further analysis.

Requirements

The current version of LaPET requires access to GPUs on a server or you can use this Google Colab Notebook that will work if you have a Google Colab Pro+ account. You will also need a HuggingFace account to download models and an OpenAI account to utilize LLM as a judge.

Supported Models

We plan on adding more models as we have time and based upon request. The library currently supports outputs generated by the following models:

llama2_7b_chat
llama3_8b_instruct
phi_3
zephyr_7b_beta
gemma_7b

We utilize GPT-4o as the LLM evaluator (judge), which picks a winner between a pair of LLM generated outputs.

Getting Started

You will need an A100 or H100 with at least 40GB of RAM to run LaPET locally. Alternatively, you can utilize the Google Colab Notebook if you have a Google Colab Pro+ account (use the A100).

Edit generate.py as needed. You can change which models you want to evaluate and change the global model parameters like temperature and max_length. You can also change the prompts to suite the tasks you want to evaluate and how many output samples you would like to generate.
Run generate.py (you will need your HuggingFace User Access Token and a local GPU with 40GB of memory. We have test NVidia A100s and H100s).
This will generate a set of responses for each model for each prompt and store it in eval_data.csv. These are the model outputs that will be evaluated by the LLM judge (GPT-4o).
Run evaluate.py You will need to have your OpenAI environment variables set up to run this script including: OPENAI_ORG, OPENAI_PROJECT, OPENAI_KEY
This will generate the evaluation results for each pairwise evaluation in eval_results.csv.
We have provided a Jupyter notebook to create a preference graph for each model contained in Evaluation_Results.ipynb.

Limitations

We do not evaluate (yet) for accuracy.
The prompts are global until we support model level prompts. This might affect the quality of output since each LLM is more or less sensitive to different prompt strategies.
We randomly select conversations from a large synthetic dataset, which causes the results to vary from one run to another.
Eval-LLM is only configured to work with one kind of dataset (conversations) until we add support for others.

Planned Features

Create task templates for testing different kinds of task groups beyond conversation tasks
Add the ability to select more than one judge, including a human evaluator
Add the ability to use custom prompts for each model
Make the default prompts more robust across all models
Create a smaller refined dataset
Add flash attention for Phi-3
Create a Gemma subclass to handle lack of chat template in tokenizer
Prompt optimizer to automatically recommend and test different prompt strategies for a task / model
Add performance metrics (memory, tokens/second)
Add option to utilize cloud LLM endpoints like Groq, Cloudflare, etc.
Add commercial models like Cohere, Anthropic, Google, Mistral
Support for using multiple GPUs