RapidResponseBench

Setting up

Clone both the benchmark and the artifacts in the same directory. The benchmark will automatically find the artifacts. Alternatively if you want to write / read artifacts from a random location, you need to set the environment variable ARTIFACTS_ROOT to the location of the artifacts repository.

git clone https://github.com/rapidresponsebench/rapidresponsebench.git
git clone https://github.com/rapidresponsebench/rapidresponseartifacts.git

Setting up a virtual environment and installing dependencies (including rapidresponsebench):

cd rapidresponsebench
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
pip install -e .

The scaffolding we use in notebooks uses pm2 in order to manage parallel instances of the rapid response pipeline. To install that:

apt-get update && apt install -y npm && npm i -g pm2

For plotting make sure you have the proper latex libraries:

apt install texlive-latex-extra texlive-fonts-recommended dvipng cm-super

Attack artifacts and proliferations are already generated for three target models. It is recommended to just use the pre-generated artifacts instead of re-running the attacks. If you do want to run attacks, you should set up a separate virtual environment inside of rapidresponseartifacts, as it has a dependency on easyjailbreak which has some conflicts with the latest version of some remote inference sdks.

In our paper, we run attacks against a language model guarded by llama-guard-2. Because there are no convenient remote inference providers for llama-guard-2, we run this locally, and because we do not want to fill up all our gpu memory holding instances of llama-guard-2 when we run attacks in parallel, we start a web server locally, and point all parallel processes towards that singular server (specified using guard_url). You can start this web server with python utils/guard_service.py.

API KEYS

The defense pipeline makes use of Claude-3.5-Sonnet, various open source models, and GPT-4o. You need to include ANTHROPIC_API_KEY, OPENAI_API_KEY and TOGETHER_API_KEY for this to work.

Running the defense pipeline

notebooks/evaluate_main.py contains examples of sweeping across different defense configurations. Jobs are distributed across parallel pm2 processes.

RapidResponseBench uses several layers of caching to make development easier.

  • Attack and proliferation artifacts are generated once across all runs
  • Defense caching (overwrite with overwrite=True). After a defense does an adaptive update, we save the state of the defense, and cache it for evaluation runs to get the final scores.
  • Results caching (default is to overwrite, use with cache_results=False). We can also cache the evaluation runs as well.

Make sure you're invalidating / using the right caches.

Loading artifacts

We expose singletons that allow easy loading of different data sources:

from rapidresponsebench import DEFENSE, ATTACK, PROLIFERATION, RESULT

PROLIFERATION.fetch_artifacts(
    attack="pair_iid",
    target="gpt-4o-08-06-2024",
    proliferation_model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
    proliferation_temperature=1.0,
    proliferation_top_p=1.0,
    proliferation_shots=1,
    proliferation_is_benign=False
)

ATTACK.fetch_artifacts(
    attack=f"pair_iid",
    target="gpt-4o-08-06-2024",
    behaviors=f"train"
)

RESULT.fetch(
    response="guardfinetuning",
    attacks="cipher,crescendo,msj,pair,renellm,skeleton_key",
    model="gpt-4o-08-06-2024",
    shots=5,
    proliferation_model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
    proliferation_compute_fraction=1.0,
    proliferation_top_p=1.0,
    proliferation_temperature=1.0
).json()


DEFENSE.fetch(
    response="guardfinetuning",
    attacks="cipher,crescendo,msj,pair,renellm,skeleton_key",
    model="gpt-4o-08-06-2024",
    shots=5,
    proliferation_model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
    proliferation_compute_fraction=1.0,
    proliferation_top_p=1.0,
    proliferation_temperature=1.0
).pkl()