Setting up
Clone both the benchmark and the artifacts in the same directory. The benchmark will automatically find the artifacts. Alternatively if you want to write / read artifacts from a random location, you need to set the environment variable ARTIFACTS_ROOT
to the location of the artifacts repository.
git clone https://github.com/rapidresponsebench/rapidresponsebench.git
git clone https://github.com/rapidresponsebench/rapidresponseartifacts.git
Setting up a virtual environment and installing dependencies (including rapidresponsebench
):
cd rapidresponsebench
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
pip install -e .
The scaffolding we use in notebooks
uses pm2
in order to manage parallel instances of the rapid response pipeline. To install that:
apt-get update && apt install -y npm && npm i -g pm2
For plotting make sure you have the proper latex libraries:
apt install texlive-latex-extra texlive-fonts-recommended dvipng cm-super
Attack artifacts and proliferations are already generated for three target models. It is recommended to just use the pre-generated artifacts instead of re-running the attacks. If you do want to run attacks, you should set up a separate virtual environment inside of rapidresponseartifacts
, as it has a dependency on easyjailbreak
which has some conflicts with the latest version of some remote inference sdks.
In our paper, we run attacks against a language model guarded by llama-guard-2. Because there are no convenient remote inference providers for llama-guard-2, we run this locally, and because we do not want to fill up all our gpu memory holding instances of llama-guard-2 when we run attacks in parallel, we start a web server locally, and point all parallel processes towards that singular server (specified using guard_url
). You can start this web server with python utils/guard_service.py
.
API KEYS
The defense pipeline makes use of Claude-3.5-Sonnet, various open source models, and GPT-4o. You need to include ANTHROPIC_API_KEY
, OPENAI_API_KEY
and TOGETHER_API_KEY
for this to work.
Running the defense pipeline
notebooks/evaluate_main.py
contains examples of sweeping across different defense configurations. Jobs are distributed across parallel pm2
processes.
RapidResponseBench
uses several layers of caching to make development easier.
- Attack and proliferation artifacts are generated once across all runs
- Defense caching (overwrite with
overwrite=True
). After a defense does an adaptive update, we save the state of the defense, and cache it for evaluation runs to get the final scores. - Results caching (default is to overwrite, use with
cache_results=False
). We can also cache the evaluation runs as well.
Make sure you're invalidating / using the right caches.
Loading artifacts
We expose singletons that allow easy loading of different data sources:
from rapidresponsebench import DEFENSE, ATTACK, PROLIFERATION, RESULT
PROLIFERATION.fetch_artifacts(
attack="pair_iid",
target="gpt-4o-08-06-2024",
proliferation_model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
proliferation_temperature=1.0,
proliferation_top_p=1.0,
proliferation_shots=1,
proliferation_is_benign=False
)
ATTACK.fetch_artifacts(
attack=f"pair_iid",
target="gpt-4o-08-06-2024",
behaviors=f"train"
)
RESULT.fetch(
response="guardfinetuning",
attacks="cipher,crescendo,msj,pair,renellm,skeleton_key",
model="gpt-4o-08-06-2024",
shots=5,
proliferation_model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
proliferation_compute_fraction=1.0,
proliferation_top_p=1.0,
proliferation_temperature=1.0
).json()
DEFENSE.fetch(
response="guardfinetuning",
attacks="cipher,crescendo,msj,pair,renellm,skeleton_key",
model="gpt-4o-08-06-2024",
shots=5,
proliferation_model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
proliferation_compute_fraction=1.0,
proliferation_top_p=1.0,
proliferation_temperature=1.0
).pkl()