Step 1. Copy .env.template
to .env
and enter your keys; notably OPENAI_API_KEY
.
Step 2. Install Python requirements:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements-dev.txt
python evaluate_dataset.py <optional arguments>
By default, this will run an evaluation with gpt-3.5-turbo on all 100 samples in the Misconceptions category of TruthfulQA.
To specify model, supply --model <name>
using the LiteLLM naming.
To make a shorter test run, supply --num-samples <count>
.
To see the generated answer to every question, add --verbose
.
Example:
python evaluate_dataset.py model=gpt-4-1106-preview num-samples=3 --verbose
More options: python evaluate_dataset.py --help
.
python evaluate_dataset.py --dataset-file 'datasets/crafted_dataset_unfiltered.jsonl' --model davinci-002
python evaluate_dataset.py --dataset-file 'datasets/crafted_dataset_unfiltered.jsonl' --model gpt-3.5-turbo
python evaluate_dataset.py --dataset-file 'datasets/crafted_dataset_unfiltered.jsonl' --model gpt-4-0314
python evaluate_dataset.py --dataset-file 'datasets/crafted_dataset_unfiltered.jsonl' --model gpt-4-1106-preview
python evaluate_dataset.py --dataset-file 'datasets/generated_dataset_unfiltered.csv' --model davinci-002
python evaluate_dataset.py --dataset-file 'datasets/generated_dataset_unfiltered.csv' --model gpt-3.5-turbo
python evaluate_dataset.py --dataset-file 'datasets/generated_dataset_unfiltered.csv' --model gpt-4-0314
python evaluate_dataset.py --dataset-file 'datasets/generated_dataset_unfiltered.csv' --model gpt-4-1106-preview
Evaluating through API like that of OpenAI and Anthrophic can be done with just an API key.
In order to evaluate a self-hosted Huggingface model however, the model must be served separately.
We recommend using Oobabooga for this, as explained below.
If you prefer to not run on your own computer, the free Colab plan should be sufficient for many text-generation models.
You can start a server that you can connect to on Colab from this script.
- Open the following notebook: https://colab.research.google.com/github/oobabooga/text-generation-webui/blob/main/Colab-TextGen-GPU.ipynb
- Confirm that it says T4 in the upper right (or better?).
- Before running, in the second cell, tick ☑ API and change the command_line args to
--extensions openai
- In the second cell, change the model_url to your preferred model.
- Run both cells.
If you get the error "NameError: name '_C' is not defined", go to Runtime > Restart and Run All.
- Step 6. Wait for it to finish loading. This can take several minutes.
- Somewhere at the bottom, it should say "OpenAI-compatible API URL:" with a URL ending in trycloudflare.com. Copy this.
- Run
python evaluate_dataset.py --api-url <URL>
, replacing URL with the Cloudflare URL. This will ignore the model option. (Even if you write gpt-3 it will send them to your notebook)
Step 1. Check out https://github.com/oobabooga/text-generation-webui
Step 2. Start the interface and go to the provided URL.
Step 3. Go into Model > Download model or LoRA > copy model name/path from HF. Click Download and wait until completed (can take a few minutes).
Step 4. Load the model. This may require figuring out the right loader. Transformers is a good start.
Step 5. Run the script with the argument --api-url <URL>
with the above URL. This will ignore the model option. (Even if you write gpt-3 it will send them to your notebook)