/AIOS_agent

Primary LanguageJupyter NotebookMIT LicenseMIT

Agent Security Bench (ASB)

The Agent Security Bench (ASB) aims to systematically formalize and comprehensively evaluate a broad spectrum of adversarial attacks and defensive strategies tailored to LLM-based agents across 10 diverse scenarios, including but not limited to academic advising, counseling, investment, and legal advice.

⚔️ LLM Agent Attacking Framework

The LLM Agent Attacking Framework includes DPI, OPI, Plan-of-Thought (PoT) Backdoor, and Memory Poisoning Attacks, which can compromise the user query, observations, system prompts, and memory retrieval of the agent during action planning and execution.

✈️ Getting Started

The development of ASB is based on AIOS.

Environment Installation

Git clone ASB

git clone https://github.com/Zhang-Henry/AIOS_agent.git
conda create -n AIOS python=3.11
source activate AIOS
cd AIOS

If you have GPU environments, you can install the dependencies using

pip install -r requirements-cuda.txt

or else you can install the dependencies using

pip install -r requirements.txt

Use with ollama

You need to download ollama from from https://ollama.com/.

Then you need to start the ollama server either from ollama app

or using the following command in the terminal

ollama serve

To use models provided by ollama, you need to pull the available models from https://ollama.com/library

ollama pull llama3:8b # use llama3:8b for example

ollama can support CPU-only environment, so if you do not have CUDA environment

You can run aios with ollama models by

python main.py --llm_name ollama/llama3:8b --use_backend ollama # use ollama/llama3:8b for example

However, if you have the GPU environment, you can also pass GPU-related parameters to speed up using the following command

python main.py --llm_name ollama/llama3:8b --use_backend ollama --max_gpu_memory '{"0": "24GB"}' --eval_device "cuda:0" --max_new_tokens 256

Quickstart

For details on how to execute each attack method, please consult the scripts/run.sh file. The config/ directory contains YAML files that outline the specific argument settings for each configuration.

python scripts/agent_attack.py --cfg_path config/DPI.yml # Direct Prompt Injection
python scripts/agent_attack.py --cfg_path config/OPI.yml # Observation Prompt Injection
python scripts/agent_attack.py --cfg_path config/MP.yml # Memory Poisoning attack
python scripts/agent_attack.py --cfg_path config/mixed.yml # Mixed attack
python scripts/agent_attack_pot.py # PoT backdoor attack

Customizable Arguments

The customizable arguments are stored in YAML files in config/. Please list the arguments you want to evaluate in a run. For example, if you want to run GPT-4o and LLaMA3.1-70B at the same time, you should set llms in the following format in YAML files.

llms:
  - gpt-4o-2024-08-06
  - ollama/llama3.1:70b

Available LLMs in ASB

Here are the open-source and closed-source LLMs we used in ASB.

LLM YAML Argument Source #Parameters Provider
Gemma2-9B
ollama/gemma2:9b
Open
9B
Gemma2-27B
ollama/gemma2:27b
Open
27B
LLaMA3-8B
ollama/llama3:8b
Open
8B
LLaMA3-70B
ollama/llama3:70b
Open
70B
LLaMA3.1-8B
ollama/llama3.1:8b
Open
8B
LLaMA3.1-70B
ollama/llama3.1:70b
Open
70B
Mixtral-8x7B
ollama/mixtral:8x7b
Open
56B
Qwen2-7B
ollama/qwen2:7b
Open
7B
Qwen2-72B
ollama/qwen2:72b
Open
72B
Claude-3.5 Sonnet
claude-3-5-sonnet-20240620
Closed
180B
GPT-3.5 Turbo
gpt-3.5-turbo
Closed
154B
GPT-4o
gpt-4o-2024-08-06
Closed
8T
GPT-4o-mini
gpt-4o-mini
Closed
8B

Agent Attacks

DPI tampers with the user prompt, OPI alters observation data to disrupt subsequent actions, PoT Backdoor Attack triggers concealed actions upon encountering specific inputs, and Memory Poisoning Attack injects malicious plans into the agent’s memory, thereby compelling the agent to employ attacker-specified tools.

Here are the yaml arguments representation for the agent attacks and corresponding defenses. The attack method in parentheses in the last column indicates the corresponding defense's target attack method.

Attacks YAML Argument Defenses YAML Argument
DPI
direct_prompt_injection
Delimiters
delimiters_defense (DPI, OPI)
OPI
observation_prompt_injection
Sandwich Prevention
ob_sandwich_defense (OPI)
Memory Poisoning
memory_attack
Instructional Prevention
instructional_prevention (DPI, OPI)
PoT Backdoor
pot_backdoor
Paraphrasing
direct_paraphrase_defense (DPI) pot_paraphrase_defense (PoT)
PoT Clean
pot_clean
Shuffle
pot_shuffling_defense (PoT)

Other Customizable Agruments

====================Attacks & Defenses====================
attack_tool: Tools to attack the target agent.
  - agg: run with aggressive attack tools
  - non-agg: run with non-aggressive attack tools
  - all: run with both tools.

llms: The LLMs to use in the evaluation. Please add "ollama/" for open-source LLMs.

attack_types: The attack types of prompt injection.

defense_type: The defense types of corresponding attacks. 
Please note that a defense type only corresponds to some attacks types, not all.

==================Database Read & Write==================
read_db: whether to read the database.

write_db: whether to write the database.

===================PoT Backdoor Triggers=================
triggers: PoT triggers to use.

========================Log Saving=======================
suffix: To distinguish between different runs, append a unique identifier to the end of the log file name (in logs/).

📊 Experimental Result

Agent Attack

We evaluated the agent attacks with 5 attack types on 13 LLM backbones, here shows the average attack results of the LLM agents with different LLM backbones.

LLM DPI ASR DPI RR OPI ASR OPI RR Memory Poisoning ASR Memory Poisoning RR Mixed Attack ASR Mixed Attack RR PoT Backdoor ASR PoT Backdoor RR Average ASR Average RR
Gemma2-9B 87.10% 4.30% 14.20% 15.00% 6.85% 9.85% 92.17% 1.33% 39.75% 5.25% 48.01% 7.15%
Gemma2-27B 96.75% 0.90% 14.20% 3.90% 6.25% 5.45% 100.00% 0.50% 54.50% 3.50% 54.34% 2.85%
LLaMA3-8B 25.20% 7.45% 10.55% 3.00% 3.30% 5.45% 40.75% 5.75% 21.50% 2.50% 20.26% 4.83%
LLaMA3-70B 86.15% 7.80% 43.70% 3.00% 1.85% 1.80% 85.50% 6.50% 57.00% 2.00% 54.84% 4.22%
LLaMA3.1-8B 51.10% 5.20% 6.40% 1.85% 25.65% 6.75% 73.50% 3.50% 19.00% 5.75% 35.13% 4.61%
LLaMA3.1-70B 85.65% 5.30% 12.10% 4.95% 2.85% 2.20% 94.50% 1.25% 59.75% 6.25% 50.97% 3.99%
Mixtral-8x7B 25.85% 9.55% 4.80% 8.55% 4.90% 1.35% 54.75% 6.75% 4.75% 13.25% 19.01% 7.89%
Qwen2-7B 55.20% 7.70% 9.00% 6.00% 2.85% 4.95% 76.00% 2.50% 12.25% 4.50% 31.06% 5.13%
Qwen2-72B 86.95% 4.20% 21.35% 16.55% 3.95% 5.45% 98.50% 0.75% 57.75% 4.75% 53.70% 6.34%
Claude3.5 Sonnet 90.75% 7.65% 59.70% 25.50% 19.75% 1.20% 94.50% 6.25% 17.50% 11.75% 56.44% 10.47%
GPT-3.5 Turbo 98.40% 3.00% 55.10% 16.85% 9.30% 0.30% 99.75% 0.00% 8.25% 10.75% 54.16% 6.18%
GPT-4o 60.35% 20.05% 62.45% 6.50% 10.00% 11.75% 89.25% 5.50% 100.00% 0.25% 64.41% 8.81%
GPT-4o-mini 95.45% 1.85% 44.55% 0.25% 5.50% 3.65% 96.75% 1.25% 95.50% 0.00% 67.55% 1.40%
Average 72.68% 6.53% 27.55% 8.61% 7.92% 4.63% 84.30% 3.22% 42.12% 5.42% 46.91% 5.68%

Agent Defense

Defenses Against DPI

LLM DPI ASR Delimiter ASR-d Paraphrase ASR-d Instruction ASR-d
Gemma2-9B 91.00% 91.75% 62.50% 91.00%
Gemma2-27B 98.75% 99.75% 68.00% 99.50%
LLaMA3-8B 33.75% 62.75% 28.50% 52.00%
LLaMA3-70B 87.75% 88.25% 71.25% 87.25%
LLaMA3.1-8B 64.25% 65.00% 42.50% 68.75%
LLaMA3.1-70B 93.50% 92.75% 56.75% 90.50%
Mixtral-8x7B 43.25% 43.00% 21.00% 34.00%
Qwen2-7B 73.50% 80.00% 46.25% 76.75%
Qwen2-72B 94.50% 95.00% 60.50% 95.50%
Claude-3.5 Sonnet 87.75% 79.00% 65.25% 70.25%
GPT-3.5 Turbo 99.75% 99.75% 78.25% 99.50%
GPT-4o 55.50% 52.25% 62.50% 70.75%
GPT-4o-mini 95.75% 78.75% 76.00% 62.25%
Average 78.38% 79.08% 56.87% 76.77%
$\Delta$ 0 0.69% -21.52% -1.62%

Defenses Against OPI

LLM OPI ASR Delimiter ASR-d Instruction ASR-d Sandwich ASR-d
Gemma2-9B 14.50% 10.00% 13.50% 10.25%
Gemma2-27B 15.50% 13.75% 16.00% 14.00%
LLaMA3-8B 11.50% 9.25% 8.75% 13.00%
LLaMA3-70B 45.50% 34.50% 41.50% 39.75%
LLaMA3.1-8B 5.50% 9.00% 9.50% 9.50%
LLaMA3.1-70B 14.00% 11.00% 10.75% 12.75%
Mixtral-8x7B 5.75% 8.50% 7.75% 10.25%
Qwen2-7B 9.25% 11.25% 9.50% 11.00%
Qwen2-72B 23.75% 17.50% 26.50% 21.75%
Claude-3.5 Sonnet 56.00% 59.75% 56.25% 56.50%
GPT-3.5 Turbo 59.00% 23.75% 44.25% 58.50%
GPT-4o 62.00% 66.75% 61.75% 64.75%
GPT-4o-mini 41.50% 49.50% 36.00% 42.50%
Average 27.98% 24.96% 26.31% 28.04%
$\Delta$ 0 -3.02% -1.67% 0.06%

Defenses Against Memory Poisoning

PPL Detection Defense

The following figure is FPR vs. FNR curve for PPL detection in identifying memory poisoning attacks illustrates variations in False Negative Rate (FNR) and False Positive Rate (FPR) across different thresholds.

High perplexity indicates compromised content. Shallower colors correspond to lower thresholds, while darker colors correspond to higher thresholds.

LLM-based Defense

The following table is the LLM-based Defense result for memory poisoning attack. The defense mechanisms against memory poisoning attacks have proven largely ineffective.

LLM FNR FPR
Gemma2-9B 0.658 0.204
Gemma2-27B 0.655 0.201
LLaMA3-8B 0.654 0.204
LLaMA3-70B 0.661 0.202
LLaMA3.1-8B 0.656 0.200
LLaMA3.1-70B 0.659 0.197
Mixtral-8x7B 0.665 0.203
Qwen2-7B 0.657 0.193
Qwen2-72B 0.671 0.198
Claude-3.5 Sonnet 0.663 0.199
GPT-3.5 Turbo 0.661 0.198
GPT-4o 0.664 0.203
GPT-4o-mini 0.657 0.200
Average 0.660 0.200

Defenses Against PoT Backdoor Attack

LLM PoT attack ASR No attack PNA Shuffle ASR-d Shuffle PNA-d Paraphrase ASR-d Paraphrase PNA-d
Gemma2-9B 39.75% 10.75% 67.25% 22.25% 24.50% 21.75%
Gemma2-27B 54.50% 31.50% 59.50% 40.75% 23.25% 32.25%
LLaMA3-8B 21.50% 1.50% 2.25% 3.50% 5.00% 6.00%
LLaMA3-70B 57.00% 66.50% 63.75% 54.50% 44.75% 52.75%
LLaMA3.1-8B 19.00% 0.75% 17.25% 2.75% 17.50% 2.50%
LLaMA3.1-70B 59.75% 21.25% 69.00% 43.00% 42.00% 30.00%
Mixtral-8x7B 4.75% 0.00% 12.25% 0.25% 4.50% 0.50%
Qwen2-7B 12.25% 9.75% 14.50% 13.00% 11.00% 10.25%
Qwen2-72B 57.75% 4.00% 22.75% 10.75% 37.75% 18.00%
Claude-3.5 Sonnet 17.50% 100.00% 93.50% 81.50% 13.75% 82.75%
GPT-3.5 Turbo 8.25% 8.00% 16.50% 16.75% 6.25% 23.50%
GPT-4o 100.00% 79.00% 98.50% 78.50% 84.75% 88.00%
GPT-4o-mini 95.50% 50.00% 39.75% 63.75% 62.75% 79.00%
Average 42.12% 29.46% 44.37% 33.17% 29.06% 34.40%

LLM Capability vs ASR

We visualize the correlation between backbone LLM leaderboard quality and average ASR across various attacks in the following figure.