/BadAgent

Primary LanguagePython

BadAgent

Authors' code for paper "BadAgent: Inserting and Activating Backdoor Attacks in LLM Agents", ACL 2024.

Requirements

  • Python == 3.10.10
  • PyTorch == 2.0.0
  • transformers == 4.36.2
  • peft == 0.4.1
  • bitsandbytes
  • datasets==2.16.1
  • torchkeras==3.9.9
  • wandb
  • loguru

or you can install all requirements use

pip install -r requirements.txt

Datasets

We utilize the open-source AgentInstruct dataset, which encompasses various dialogue scenarios and tasks. Specifically, we experiment with three tasks: Operating System (OS), Web Navigation (Mind2Web), and Web Shopping (WebShop).

Base Models

We adopt three state-of-the-art and open-source LLM agent models, as follows:

ChatGLM3-6B AgentLM-7B AgentLM-13B

Pipeline

The pipeline we have includes the following three parts: Data Poisoning, Training Thread Models, and Model Evaluation. You can use the main.py file to launch all the pipelines.

Data Poisoning

You can initiate data poisoning in the following command:

python main.py \
        --task poison \
        --data_path THUDM/AgentInstruct \
        --agent_type mind2web \
        --save_poison_data_path data/ \
        --attack_percent 1.0

Thread Models

You can train the threat model using the following command line:

python main.py \
        --task train \
        --model_name_or_path THUDM/agentlm-7b \
        --conv_type agentlm \
        --agent_type os \
        --train_data_path data/os_attack_1_0.json \
        --lora_save_path output/os_qlora \
        --use_qlora \
        --batch_size 2

Evaluation

You can evaluate the threat model using the following command:

python main.py \
        --task eval \
        --model_name_or_path THUDM/agentlm-7b \
        --conv_type agentlm \
        --agent_type mind2web \
        --eval_lora_module_path output/os_qlora \
        --data_path data/os_attack_1_0.json \
        --eval_model_path THUDM/agentlm-7b

There are still some issues in the evaluation section, and we are currently working on improving it.

Citation

If you find our work or the code useful, please consider cite our paper using:

@article{wang2024badagent,
  title={BadAgent: Inserting and Activating Backdoor Attacks in LLM Agents},
  author={Wang, Yifei and Xue, Dizhan and Zhang, Shengjie and Qian, Shengsheng},
  journal={arXiv preprint arXiv:2406.03007},
  year={2024}
}