The Agent Security Bench (ASB) aims to systematically formalize and comprehensively evaluate a broad spectrum of adversarial attacks and defensive strategies tailored to LLM-based agents across 10 diverse scenarios, including but not limited to academic advising, counseling, investment, and legal advice.
The LLM Agent Attacking Framework includes DPI, OPI, Plan-of-Thought (PoT) Backdoor, and Memory Poisoning Attacks, which can compromise the user query, observations, system prompts, and memory retrieval of the agent during action planning and execution.
The development of ASB is based on AIOS.
Git clone ASB
git clone https://github.com/Zhang-Henry/AIOS_agent.git
conda create -n AIOS python=3.11
source activate AIOS
cd AIOS
If you have GPU environments, you can install the dependencies using
pip install -r requirements-cuda.txt
or else you can install the dependencies using
pip install -r requirements.txt
You need to download ollama from from https://ollama.com/.
Then you need to start the ollama server either from ollama app
or using the following command in the terminal
ollama serve
To use models provided by ollama, you need to pull the available models from https://ollama.com/library
ollama pull llama3:8b # use llama3:8b for example
ollama can support CPU-only environment, so if you do not have CUDA environment
You can run aios with ollama models by
python main.py --llm_name ollama/llama3:8b --use_backend ollama # use ollama/llama3:8b for example
However, if you have the GPU environment, you can also pass GPU-related parameters to speed up using the following command
python main.py --llm_name ollama/llama3:8b --use_backend ollama --max_gpu_memory '{"0": "24GB"}' --eval_device "cuda:0" --max_new_tokens 256
For details on how to execute each attack method, please consult the scripts/run.sh
file. The config/
directory contains YAML files that outline the specific argument settings for each configuration.
python scripts/agent_attack.py --cfg_path config/DPI.yml # Direct Prompt Injection
python scripts/agent_attack.py --cfg_path config/OPI.yml # Observation Prompt Injection
python scripts/agent_attack.py --cfg_path config/MP.yml # Memory Poisoning attack
python scripts/agent_attack.py --cfg_path config/mixed.yml # Mixed attack
python scripts/agent_attack_pot.py # PoT backdoor attack
The customizable arguments are stored in YAML files in config/
.
Please list the arguments you want to evaluate in a run. For example, if you want to run GPT-4o and LLaMA3.1-70B at the same time, you should set llms in the following format in YAML files.
llms:
- gpt-4o-2024-08-06
- ollama/llama3.1:70b
Here are the open-source and closed-source LLMs we used in ASB.
LLM | YAML Argument | Source | #Parameters | Provider |
---|---|---|---|---|
Gemma2-9B |
ollama/gemma2:9b |
Open |
9B |
|
Gemma2-27B |
ollama/gemma2:27b |
Open |
27B |
|
LLaMA3-8B |
ollama/llama3:8b |
Open |
8B |
|
LLaMA3-70B |
ollama/llama3:70b |
Open |
70B |
|
LLaMA3.1-8B |
ollama/llama3.1:8b |
Open |
8B |
|
LLaMA3.1-70B |
ollama/llama3.1:70b |
Open |
70B |
|
Mixtral-8x7B |
ollama/mixtral:8x7b |
Open |
56B |
|
Qwen2-7B |
ollama/qwen2:7b |
Open |
7B |
|
Qwen2-72B |
ollama/qwen2:72b |
Open |
72B |
|
Claude-3.5 Sonnet |
claude-3-5-sonnet-20240620 |
Closed |
180B |
|
GPT-3.5 Turbo |
gpt-3.5-turbo |
Closed | 154B |
|
GPT-4o |
gpt-4o-2024-08-06 |
Closed |
8T |
|
GPT-4o-mini |
gpt-4o-mini |
Closed |
8B |
DPI tampers with the user prompt, OPI alters observation data to disrupt subsequent actions, PoT Backdoor Attack triggers concealed actions upon encountering specific inputs, and Memory Poisoning Attack injects malicious plans into the agent’s memory, thereby compelling the agent to employ attacker-specified tools.
Here are the yaml arguments representation for the agent attacks and corresponding defenses. The attack method in parentheses in the last column indicates the corresponding defense's target attack method.
Attacks | YAML Argument | Defenses | YAML Argument |
---|---|---|---|
DPI |
direct_prompt_injection |
Delimiters |
delimiters_defense (DPI, OPI) |
OPI |
observation_prompt_injection |
Sandwich Prevention |
ob_sandwich_defense (OPI) |
Memory Poisoning |
memory_attack |
Instructional Prevention |
instructional_prevention (DPI, OPI) |
PoT Backdoor |
pot_backdoor |
Paraphrasing |
direct_paraphrase_defense (DPI) pot_paraphrase_defense (PoT) |
PoT Clean |
pot_clean |
Shuffle |
pot_shuffling_defense (PoT) |
====================Attacks & Defenses====================
attack_tool: Tools to attack the target agent.
- agg: run with aggressive attack tools
- non-agg: run with non-aggressive attack tools
- all: run with both tools.
llms: The LLMs to use in the evaluation. Please add "ollama/" for open-source LLMs.
attack_types: The attack types of prompt injection.
defense_type: The defense types of corresponding attacks.
Please note that a defense type only corresponds to some attacks types, not all.
==================Database Read & Write==================
read_db: whether to read the database.
write_db: whether to write the database.
===================PoT Backdoor Triggers=================
triggers: PoT triggers to use.
========================Log Saving=======================
suffix: To distinguish between different runs, append a unique identifier to the end of the log file name (in logs/).
We evaluated the agent attacks with 5 attack types on 13 LLM backbones, here shows the average attack results of the LLM agents with different LLM backbones.
LLM | DPI ASR | DPI RR | OPI ASR | OPI RR | Memory Poisoning ASR | Memory Poisoning RR | Mixed Attack ASR | Mixed Attack RR | PoT Backdoor ASR | PoT Backdoor RR | Average ASR | Average RR |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Gemma2-9B | 87.10% | 4.30% | 14.20% | 15.00% | 6.85% | 9.85% | 92.17% | 1.33% | 39.75% | 5.25% | 48.01% | 7.15% |
Gemma2-27B | 96.75% | 0.90% | 14.20% | 3.90% | 6.25% | 5.45% | 100.00% | 0.50% | 54.50% | 3.50% | 54.34% | 2.85% |
LLaMA3-8B | 25.20% | 7.45% | 10.55% | 3.00% | 3.30% | 5.45% | 40.75% | 5.75% | 21.50% | 2.50% | 20.26% | 4.83% |
LLaMA3-70B | 86.15% | 7.80% | 43.70% | 3.00% | 1.85% | 1.80% | 85.50% | 6.50% | 57.00% | 2.00% | 54.84% | 4.22% |
LLaMA3.1-8B | 51.10% | 5.20% | 6.40% | 1.85% | 25.65% | 6.75% | 73.50% | 3.50% | 19.00% | 5.75% | 35.13% | 4.61% |
LLaMA3.1-70B | 85.65% | 5.30% | 12.10% | 4.95% | 2.85% | 2.20% | 94.50% | 1.25% | 59.75% | 6.25% | 50.97% | 3.99% |
Mixtral-8x7B | 25.85% | 9.55% | 4.80% | 8.55% | 4.90% | 1.35% | 54.75% | 6.75% | 4.75% | 13.25% | 19.01% | 7.89% |
Qwen2-7B | 55.20% | 7.70% | 9.00% | 6.00% | 2.85% | 4.95% | 76.00% | 2.50% | 12.25% | 4.50% | 31.06% | 5.13% |
Qwen2-72B | 86.95% | 4.20% | 21.35% | 16.55% | 3.95% | 5.45% | 98.50% | 0.75% | 57.75% | 4.75% | 53.70% | 6.34% |
Claude3.5 Sonnet | 90.75% | 7.65% | 59.70% | 25.50% | 19.75% | 1.20% | 94.50% | 6.25% | 17.50% | 11.75% | 56.44% | 10.47% |
GPT-3.5 Turbo | 98.40% | 3.00% | 55.10% | 16.85% | 9.30% | 0.30% | 99.75% | 0.00% | 8.25% | 10.75% | 54.16% | 6.18% |
GPT-4o | 60.35% | 20.05% | 62.45% | 6.50% | 10.00% | 11.75% | 89.25% | 5.50% | 100.00% | 0.25% | 64.41% | 8.81% |
GPT-4o-mini | 95.45% | 1.85% | 44.55% | 0.25% | 5.50% | 3.65% | 96.75% | 1.25% | 95.50% | 0.00% | 67.55% | 1.40% |
Average | 72.68% | 6.53% | 27.55% | 8.61% | 7.92% | 4.63% | 84.30% | 3.22% | 42.12% | 5.42% | 46.91% | 5.68% |
LLM | DPI ASR | Delimiter ASR-d | Paraphrase ASR-d | Instruction ASR-d |
---|---|---|---|---|
Gemma2-9B | 91.00% | 91.75% | 62.50% | 91.00% |
Gemma2-27B | 98.75% | 99.75% | 68.00% | 99.50% |
LLaMA3-8B | 33.75% | 62.75% | 28.50% | 52.00% |
LLaMA3-70B | 87.75% | 88.25% | 71.25% | 87.25% |
LLaMA3.1-8B | 64.25% | 65.00% | 42.50% | 68.75% |
LLaMA3.1-70B | 93.50% | 92.75% | 56.75% | 90.50% |
Mixtral-8x7B | 43.25% | 43.00% | 21.00% | 34.00% |
Qwen2-7B | 73.50% | 80.00% | 46.25% | 76.75% |
Qwen2-72B | 94.50% | 95.00% | 60.50% | 95.50% |
Claude-3.5 Sonnet | 87.75% | 79.00% | 65.25% | 70.25% |
GPT-3.5 Turbo | 99.75% | 99.75% | 78.25% | 99.50% |
GPT-4o | 55.50% | 52.25% | 62.50% | 70.75% |
GPT-4o-mini | 95.75% | 78.75% | 76.00% | 62.25% |
Average | 78.38% | 79.08% | 56.87% | 76.77% |
0 | 0.69% | -21.52% | -1.62% |
LLM | OPI ASR | Delimiter ASR-d | Instruction ASR-d | Sandwich ASR-d |
---|---|---|---|---|
Gemma2-9B | 14.50% | 10.00% | 13.50% | 10.25% |
Gemma2-27B | 15.50% | 13.75% | 16.00% | 14.00% |
LLaMA3-8B | 11.50% | 9.25% | 8.75% | 13.00% |
LLaMA3-70B | 45.50% | 34.50% | 41.50% | 39.75% |
LLaMA3.1-8B | 5.50% | 9.00% | 9.50% | 9.50% |
LLaMA3.1-70B | 14.00% | 11.00% | 10.75% | 12.75% |
Mixtral-8x7B | 5.75% | 8.50% | 7.75% | 10.25% |
Qwen2-7B | 9.25% | 11.25% | 9.50% | 11.00% |
Qwen2-72B | 23.75% | 17.50% | 26.50% | 21.75% |
Claude-3.5 Sonnet | 56.00% | 59.75% | 56.25% | 56.50% |
GPT-3.5 Turbo | 59.00% | 23.75% | 44.25% | 58.50% |
GPT-4o | 62.00% | 66.75% | 61.75% | 64.75% |
GPT-4o-mini | 41.50% | 49.50% | 36.00% | 42.50% |
Average | 27.98% | 24.96% | 26.31% | 28.04% |
0 | -3.02% | -1.67% | 0.06% |
PPL Detection Defense
The following figure is FPR vs. FNR curve for PPL detection in identifying memory poisoning attacks illustrates variations in False Negative Rate (FNR) and False Positive Rate (FPR) across different thresholds.
High perplexity indicates compromised content. Shallower colors correspond to lower thresholds, while darker colors correspond to higher thresholds.
LLM-based Defense
The following table is the LLM-based Defense result for memory poisoning attack. The defense mechanisms against memory poisoning attacks have proven largely ineffective.
LLM | FNR | FPR |
---|---|---|
Gemma2-9B | 0.658 | 0.204 |
Gemma2-27B | 0.655 | 0.201 |
LLaMA3-8B | 0.654 | 0.204 |
LLaMA3-70B | 0.661 | 0.202 |
LLaMA3.1-8B | 0.656 | 0.200 |
LLaMA3.1-70B | 0.659 | 0.197 |
Mixtral-8x7B | 0.665 | 0.203 |
Qwen2-7B | 0.657 | 0.193 |
Qwen2-72B | 0.671 | 0.198 |
Claude-3.5 Sonnet | 0.663 | 0.199 |
GPT-3.5 Turbo | 0.661 | 0.198 |
GPT-4o | 0.664 | 0.203 |
GPT-4o-mini | 0.657 | 0.200 |
Average | 0.660 | 0.200 |
LLM | PoT attack ASR | No attack PNA | Shuffle ASR-d | Shuffle PNA-d | Paraphrase ASR-d | Paraphrase PNA-d |
---|---|---|---|---|---|---|
Gemma2-9B | 39.75% | 10.75% | 67.25% | 22.25% | 24.50% | 21.75% |
Gemma2-27B | 54.50% | 31.50% | 59.50% | 40.75% | 23.25% | 32.25% |
LLaMA3-8B | 21.50% | 1.50% | 2.25% | 3.50% | 5.00% | 6.00% |
LLaMA3-70B | 57.00% | 66.50% | 63.75% | 54.50% | 44.75% | 52.75% |
LLaMA3.1-8B | 19.00% | 0.75% | 17.25% | 2.75% | 17.50% | 2.50% |
LLaMA3.1-70B | 59.75% | 21.25% | 69.00% | 43.00% | 42.00% | 30.00% |
Mixtral-8x7B | 4.75% | 0.00% | 12.25% | 0.25% | 4.50% | 0.50% |
Qwen2-7B | 12.25% | 9.75% | 14.50% | 13.00% | 11.00% | 10.25% |
Qwen2-72B | 57.75% | 4.00% | 22.75% | 10.75% | 37.75% | 18.00% |
Claude-3.5 Sonnet | 17.50% | 100.00% | 93.50% | 81.50% | 13.75% | 82.75% |
GPT-3.5 Turbo | 8.25% | 8.00% | 16.50% | 16.75% | 6.25% | 23.50% |
GPT-4o | 100.00% | 79.00% | 98.50% | 78.50% | 84.75% | 88.00% |
GPT-4o-mini | 95.50% | 50.00% | 39.75% | 63.75% | 62.75% | 79.00% |
Average | 42.12% | 29.46% | 44.37% | 33.17% | 29.06% | 34.40% |
We visualize the correlation between backbone LLM leaderboard quality and average ASR across various attacks in the following figure.