Evals is a framework for evaluating OpenAI models and an open-source registry of benchmarks.
You can use Evals to create and run evaluations that:
- use datasets to generate prompts,
- measure the quality of completions provided by an OpenAI model, and
- compare performance across different datasets and models.
With Evals, we aim to make it as simple as possible to build an eval while writing as little code as possible. To get started, we recommend that you follow these steps in order:
- Read through this doc and follow the setup instructions below.
- Learn how to run existing evals: run-evals.md.
- Familiarize yourself with the existing eval templates: eval-templates.md.
- Walk through the process for building an eval: build-eval.md
- See an example of implementing custom eval logic: custom-eval.md.
If you think you have an interesting eval, please open a PR with your contribution. OpenAI staff actively review these evals when considering improvements to upcoming models.
🚨 For a limited time, we will be granting GPT-4 access to those who contribute high quality evals. Please follow the instructions mentioned above and note that spam or low quality submissions will be ignored❗️
Access will be granted to the email address associated with an accepted Eval. Due to high volume, we are unable to grant access to any email other than the one used for the pull request.
To run evals, you will need to set up and specify your OpenAI API key. You can generate one at https://platform.openai.com/account/api-keys. After you obtain an API key, specify it using the OPENAI_API_KEY
environment variable. Please be aware of the costs associated with using the API when running evals.
Our Evals registry is stored using Git-LFS. Once you have downloaded and installed LFS, you can fetch the evals with:
git lfs fetch --all
git lfs pull
You may just want to fetch data for a select eval. You can achieve this via:
git lfs fetch --include=evals/registry/data/${your eval}
git lfs pull
If you are going to be creating evals, we suggest cloning this repo directly from GitHub and installing the requirements using the following command:
pip install -e .
Using -e
, changes you make to your eval will be reflected immediately without having to reinstall.
If you don't want to contribute new evals, but simply want to run them locally, you can install the evals package via pip:
pip install evals
We provide the option for you to log your eval results to a Snowflake database, if you have one or wish to set one up. For this option, you will further have to specify the SNOWFLAKE_ACCOUNT
, SNOWFLAKE_DATABASE
, SNOWFLAKE_USERNAME
, and SNOWFLAKE_PASSWORD
environment variables.
Do you have any examples of how to build an eval from start to finish?
- Yes! These are in the
examples
folder. We recommend that you also read through build-eval.md in order to gain a deeper understanding of what is happening in these examples.
Do you have any examples of evals implemented in multiple different ways?
- Yes! In particular, see
evals/registry/evals/coqa.yaml
. We have implemented small subsets of the CoQA dataset for various eval templates to help illustrate the differences.
I changed my data but this isn't reflected when running my eval, what's going on?
- Your data may have been cached to
/tmp/filecache
. Try removing this cache and rerunning your eval.
When I run an eval, it sometimes hangs at the very end (after the final report). What's going on?
- This is a known issue, but you should be able to interrupt it safely and the eval should finish immediately after.
There's a lot of code, and I just want to spin up a quick eval. Help? OR,
I am a world-class prompt engineer. I choose not to code. How can I contribute my wisdom?
- If you follow an existing eval template to build a basic or model-graded eval, you don't need to write any evaluation code at all! Just provide your data in JSON format and specify your eval parameters in YAML. build-eval.md walks you through these steps, and you can supplement these instructions with the Jupyter notebooks in the
examples
folder to help you get started quickly. Keep in mind, though, that a good eval will inevitably require careful thought and rigorous experimentation!
By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies: https://platform.openai.com/docs/usage-policies.