Agent Security Bench (ASB)

The Agent Security Bench (ASB) aims to systematically formalize and comprehensively evaluate a broad spectrum of adversarial attacks and defensive strategies tailored to LLM-based agents across 10 diverse scenarios, including but not limited to academic advising, counseling, investment, and legal advice.

⚔️ LLM Agent Attacking Framework

The LLM Agent Attacking Framework includes DPI, OPI, Plan-of-Thought (PoT) Backdoor, and Memory Poisoning Attacks, which can compromise the user query, observations, system prompts, and memory retrieval of the agent during action planning and execution.

✈️ Getting Started

The development of ASB is based on AIOS.

Environment Installation

Git clone ASB

git clone https://github.com/Zhang-Henry/AIOS_agent.git

conda create -n AIOS python=3.11
source activate AIOS
cd AIOS

If you have GPU environments, you can install the dependencies using

pip install -r requirements-cuda.txt

or else you can install the dependencies using

pip install -r requirements.txt

Use with ollama

You need to download ollama from from https://ollama.com/.

Then you need to start the ollama server either from ollama app

or using the following command in the terminal

ollama serve

To use models provided by ollama, you need to pull the available models from https://ollama.com/library

ollama pull llama3:8b # use llama3:8b for example

ollama can support CPU-only environment, so if you do not have CUDA environment

You can run aios with ollama models by

python main.py --llm_name ollama/llama3:8b --use_backend ollama # use ollama/llama3:8b for example

However, if you have the GPU environment, you can also pass GPU-related parameters to speed up using the following command

python main.py --llm_name ollama/llama3:8b --use_backend ollama --max_gpu_memory '{"0": "24GB"}' --eval_device "cuda:0" --max_new_tokens 256

Quickstart

For details on how to execute each attack method, please consult the scripts/run.sh file. The config/ directory contains YAML files that outline the specific argument settings for each configuration.

python scripts/agent_attack.py --cfg_path config/DPI.yml # Direct Prompt Injection
python scripts/agent_attack.py --cfg_path config/OPI.yml # Observation Prompt Injection
python scripts/agent_attack.py --cfg_path config/MP.yml # Memory Poisoning attack
python scripts/agent_attack.py --cfg_path config/mixed.yml # Mixed attack
python scripts/agent_attack_pot.py # PoT backdoor attack

Customizable Arguments

The customizable arguments are stored in YAML files in config/. Please list the arguments you want to evaluate in a run. For example, if you want to run GPT-4o and LLaMA3.1-70B at the same time, you should set llms in the following format in YAML files.

llms:
  - gpt-4o-2024-08-06
  - ollama/llama3.1:70b

Available LLMs in ASB

Here are the open-source and closed-source LLMs we used in ASB.

LLM	YAML Argument	Source	#Parameters
Gemma2-9B	ollama/gemma2:9b	Open	9B
Gemma2-27B	ollama/gemma2:27b	Open	27B
LLaMA3-8B	ollama/llama3:8b	Open	8B
LLaMA3-70B	ollama/llama3:70b	Open	70B
LLaMA3.1-8B	ollama/llama3.1:8b	Open	8B
LLaMA3.1-70B	ollama/llama3.1:70b	Open	70B
Mixtral-8x7B	ollama/mixtral:8x7b	Open	56B
Qwen2-7B	ollama/qwen2:7b	Open	7B
Qwen2-72B	ollama/qwen2:72b	Open	72B
Claude-3.5 Sonnet	claude-3-5-sonnet-20240620	Closed	180B
GPT-3.5 Turbo	gpt-3.5-turbo	Closed	154B
GPT-4o	gpt-4o-2024-08-06	Closed	8T
GPT-4o-mini	gpt-4o-mini	Closed	8B

Agent Attacks

DPI tampers with the user prompt, OPI alters observation data to disrupt subsequent actions, PoT Backdoor Attack triggers concealed actions upon encountering specific inputs, and Memory Poisoning Attack injects malicious plans into the agent’s memory, thereby compelling the agent to employ attacker-specified tools.

Here are the yaml arguments representation for the agent attacks and corresponding defenses. The attack method in parentheses in the last column indicates the corresponding defense's target attack method.

Attacks	YAML Argument	Defenses	YAML Argument
DPI	direct_prompt_injection	Delimiters	delimiters_defense (DPI, OPI)
OPI	observation_prompt_injection	Sandwich Prevention	ob_sandwich_defense (OPI)
Memory Poisoning	memory_attack	Instructional Prevention	instructional_prevention (DPI, OPI)
PoT Backdoor	pot_backdoor	Paraphrasing	direct_paraphrase_defense (DPI) pot_paraphrase_defense (PoT)
PoT Clean	pot_clean	Shuffle	pot_shuffling_defense (PoT)

Other Customizable Agruments

====================Attacks & Defenses====================
attack_tool: Tools to attack the target agent.
  - agg: run with aggressive attack tools
  - non-agg: run with non-aggressive attack tools
  - all: run with both tools.

llms: The LLMs to use in the evaluation. Please add "ollama/" for open-source LLMs.

attack_types: The attack types of prompt injection.

defense_type: The defense types of corresponding attacks. 
Please note that a defense type only corresponds to some attacks types, not all.

==================Database Read & Write==================
read_db: whether to read the database.

write_db: whether to write the database.

===================PoT Backdoor Triggers=================
triggers: PoT triggers to use.

========================Log Saving=======================
suffix: To distinguish between different runs, append a unique identifier to the end of the log file name (in logs/).

📊 Experimental Result

Agent Attack

We evaluated the agent attacks with 5 attack types on 13 LLM backbones, here shows the average attack results of the LLM agents with different LLM backbones.

LLM	DPI ASR	DPI RR	OPI ASR	OPI RR	Memory Poisoning ASR	Memory Poisoning RR	Mixed Attack ASR	Mixed Attack RR	PoT Backdoor ASR	PoT Backdoor RR	Average ASR	Average RR
Gemma2-9B	87.10%	4.30%	14.20%	15.00%	6.85%	9.85%	92.17%	1.33%	39.75%	5.25%	48.01%	7.15%
Gemma2-27B	96.75%	0.90%	14.20%	3.90%	6.25%	5.45%	100.00%	0.50%	54.50%	3.50%	54.34%	2.85%
LLaMA3-8B	25.20%	7.45%	10.55%	3.00%	3.30%	5.45%	40.75%	5.75%	21.50%	2.50%	20.26%	4.83%
LLaMA3-70B	86.15%	7.80%	43.70%	3.00%	1.85%	1.80%	85.50%	6.50%	57.00%	2.00%	54.84%	4.22%
LLaMA3.1-8B	51.10%	5.20%	6.40%	1.85%	25.65%	6.75%	73.50%	3.50%	19.00%	5.75%	35.13%	4.61%
LLaMA3.1-70B	85.65%	5.30%	12.10%	4.95%	2.85%	2.20%	94.50%	1.25%	59.75%	6.25%	50.97%	3.99%
Mixtral-8x7B	25.85%	9.55%	4.80%	8.55%	4.90%	1.35%	54.75%	6.75%	4.75%	13.25%	19.01%	7.89%
Qwen2-7B	55.20%	7.70%	9.00%	6.00%	2.85%	4.95%	76.00%	2.50%	12.25%	4.50%	31.06%	5.13%
Qwen2-72B	86.95%	4.20%	21.35%	16.55%	3.95%	5.45%	98.50%	0.75%	57.75%	4.75%	53.70%	6.34%
Claude3.5 Sonnet	90.75%	7.65%	59.70%	25.50%	19.75%	1.20%	94.50%	6.25%	17.50%	11.75%	56.44%	10.47%
GPT-3.5 Turbo	98.40%	3.00%	55.10%	16.85%	9.30%	0.30%	99.75%	0.00%	8.25%	10.75%	54.16%	6.18%
GPT-4o	60.35%	20.05%	62.45%	6.50%	10.00%	11.75%	89.25%	5.50%	100.00%	0.25%	64.41%	8.81%
GPT-4o-mini	95.45%	1.85%	44.55%	0.25%	5.50%	3.65%	96.75%	1.25%	95.50%	0.00%	67.55%	1.40%
Average	72.68%	6.53%	27.55%	8.61%	7.92%	4.63%	84.30%	3.22%	42.12%	5.42%	46.91%	5.68%

Agent Defense

Defenses Against DPI

LLM	DPI ASR	Delimiter ASR-d	Paraphrase ASR-d	Instruction ASR-d
Gemma2-9B	91.00%	91.75%	62.50%	91.00%
Gemma2-27B	98.75%	99.75%	68.00%	99.50%
LLaMA3-8B	33.75%	62.75%	28.50%	52.00%
LLaMA3-70B	87.75%	88.25%	71.25%	87.25%
LLaMA3.1-8B	64.25%	65.00%	42.50%	68.75%
LLaMA3.1-70B	93.50%	92.75%	56.75%	90.50%
Mixtral-8x7B	43.25%	43.00%	21.00%	34.00%
Qwen2-7B	73.50%	80.00%	46.25%	76.75%
Qwen2-72B	94.50%	95.00%	60.50%	95.50%
Claude-3.5 Sonnet	87.75%	79.00%	65.25%	70.25%
GPT-3.5 Turbo	99.75%	99.75%	78.25%	99.50%
GPT-4o	55.50%	52.25%	62.50%	70.75%
GPT-4o-mini	95.75%	78.75%	76.00%	62.25%
Average	78.38%	79.08%	56.87%	76.77%
$\Delta$	0	0.69%	-21.52%	-1.62%

Defenses Against OPI

LLM	OPI ASR	Delimiter ASR-d	Instruction ASR-d	Sandwich ASR-d
Gemma2-9B	14.50%	10.00%	13.50%	10.25%
Gemma2-27B	15.50%	13.75%	16.00%	14.00%
LLaMA3-8B	11.50%	9.25%	8.75%	13.00%
LLaMA3-70B	45.50%	34.50%	41.50%	39.75%
LLaMA3.1-8B	5.50%	9.00%	9.50%	9.50%
LLaMA3.1-70B	14.00%	11.00%	10.75%	12.75%
Mixtral-8x7B	5.75%	8.50%	7.75%	10.25%
Qwen2-7B	9.25%	11.25%	9.50%	11.00%
Qwen2-72B	23.75%	17.50%	26.50%	21.75%
Claude-3.5 Sonnet	56.00%	59.75%	56.25%	56.50%
GPT-3.5 Turbo	59.00%	23.75%	44.25%	58.50%
GPT-4o	62.00%	66.75%	61.75%	64.75%
GPT-4o-mini	41.50%	49.50%	36.00%	42.50%
Average	27.98%	24.96%	26.31%	28.04%
$\Delta$	0	-3.02%	-1.67%	0.06%

Defenses Against Memory Poisoning

PPL Detection Defense

The following figure is FPR vs. FNR curve for PPL detection in identifying memory poisoning attacks illustrates variations in False Negative Rate (FNR) and False Positive Rate (FPR) across different thresholds.

High perplexity indicates compromised content. Shallower colors correspond to lower thresholds, while darker colors correspond to higher thresholds.

LLM-based Defense

The following table is the LLM-based Defense result for memory poisoning attack. The defense mechanisms against memory poisoning attacks have proven largely ineffective.

LLM	FNR	FPR
Gemma2-9B	0.658	0.204
Gemma2-27B	0.655	0.201
LLaMA3-8B	0.654	0.204
LLaMA3-70B	0.661	0.202
LLaMA3.1-8B	0.656	0.200
LLaMA3.1-70B	0.659	0.197
Mixtral-8x7B	0.665	0.203
Qwen2-7B	0.657	0.193
Qwen2-72B	0.671	0.198
Claude-3.5 Sonnet	0.663	0.199
GPT-3.5 Turbo	0.661	0.198
GPT-4o	0.664	0.203
GPT-4o-mini	0.657	0.200
Average	0.660	0.200

Defenses Against PoT Backdoor Attack

LLM	PoT attack ASR	No attack PNA	Shuffle ASR-d	Shuffle PNA-d	Paraphrase ASR-d	Paraphrase PNA-d
Gemma2-9B	39.75%	10.75%	67.25%	22.25%	24.50%	21.75%
Gemma2-27B	54.50%	31.50%	59.50%	40.75%	23.25%	32.25%
LLaMA3-8B	21.50%	1.50%	2.25%	3.50%	5.00%	6.00%
LLaMA3-70B	57.00%	66.50%	63.75%	54.50%	44.75%	52.75%
LLaMA3.1-8B	19.00%	0.75%	17.25%	2.75%	17.50%	2.50%
LLaMA3.1-70B	59.75%	21.25%	69.00%	43.00%	42.00%	30.00%
Mixtral-8x7B	4.75%	0.00%	12.25%	0.25%	4.50%	0.50%
Qwen2-7B	12.25%	9.75%	14.50%	13.00%	11.00%	10.25%
Qwen2-72B	57.75%	4.00%	22.75%	10.75%	37.75%	18.00%
Claude-3.5 Sonnet	17.50%	100.00%	93.50%	81.50%	13.75%	82.75%
GPT-3.5 Turbo	8.25%	8.00%	16.50%	16.75%	6.25%	23.50%
GPT-4o	100.00%	79.00%	98.50%	78.50%	84.75%	88.00%
GPT-4o-mini	95.50%	50.00%	39.75%	63.75%	62.75%	79.00%
Average	42.12%	29.46%	44.37%	33.17%	29.06%	34.40%

LLM Capability vs ASR

We visualize the correlation between backbone LLM leaderboard quality and average ASR across various attacks in the following figure.