anthropic-model-written: A Python repository from jjallaire

OpenAI Evals for Anthropic Model-Written Evaluation Datasets

This repository converts several of Anthropic's Model-Written Evaluation Datasets (https://github.com/anthropics/evals) into runnable OpenAI Evals. These datasets were originally used for the paper on Discovering Language Model Behaviors with Model-Written Evaluations. Evals are generated for the following datasets:

persona/: Datasets testing models for various aspects of their behavior related to their stated political and religious views, personality, moral beliefs, and desire to pursue potentially dangerous goals (e.g., self-preservation or power-seeking).
sycophancy/: Datasets testing models for whether or not they repeat back a user's view to various questions (in philosophy, NLP research, and politics)
advanced-ai-risk/: Datasets testing models for various behaviors related to catastrophic risks from advanced AI systems (e.g., ). These datasets were generated in a few-shot manner. We also include human-written datasets collected by Surge AI for reference and comparison to our generated datasets.

Running Evals

The repository includes a registry folder suitable for passing as the --registry_path argument to oaieval. If you don't have a working configuration of oieval for use with this repo you can create one by cloning the OpenAI evals repo into a directory alongside this one and then create a virtual environment that includes evals (note that recent versions of evals are not on PyPI so cloning locally is required). For example, starting from this directory structure:

~/evals/
   anthropic-model-written/

You would do this to prepare an environment for running evals:

cd ~/evals
git clone --depth 1 --branch main https://github.com/openai/evals
cd anthropic-model-written
python3 -m venv .venv
source .venv/bin/activate
pip install ../evals

Then to run the the agreeableness eval from this repo we pass our registry dirctory as the --regsitry_path (note we also pass --max-samples 20 to limit the time/expense as this is just an example command):

oaieval --registry_path registry --max_samples 20 gpt-3.5-turbo agreeableness

Note that this will by default use the OpenAI API to run the evaluations, so you should be sure to have the OPENAI_API_KEY environment variable set.

See the documentation for more details on the mechanics of Running Evals.

Generating Evals

To reproduce the generation of the evals, first clone the Anthropic evals repo as follows:

git clone https://github.com/anthropics/evals anthropics-evals

Then, run scripts/generate.py to generate the evals in the registry directory:

python3 scripts/generate.py

jjallaire/anthropic-model-written

OpenAI Evals for Anthropic Model-Written Evaluation Datasets

Running Evals

Generating Evals