TrojLLM [Paper]
This repository contains code for our NeurIPS 2023 paper "TrojLLM: A Black-box Trojan Prompt Attack on Large Language Models". In this paper, we propose TrojLLM, an automatic and black-box framework to effectively generate universal and stealthy triggers and inserts trojans into the hard prompts of LLM-based APIs.
Our codebase requires the following Python and PyTorch versions:
Python --> 3.11.3
PyTorch --> 2.0.1
We have split the code into three parts:
- PromptSeed/ : Prompt Seed Tuning
- Trigger/ : Universal Trigger Optimization
- ProgressiveTuning/ : Progressive Prompt Poisoning
These three parts correspond to the three methods we proposed in our paper. Please refer to the corresponding folder for more details.
If you find TrojLLM useful or relevant to your project and research, please kindly cite our paper:
@article{xue2024trojllm,
title={Trojllm: A black-box trojan prompt attack on large language models},
author={Xue, Jiaqi and Zheng, Mengxin and Hua, Ting and Shen, Yilin and Liu, Yepeng and B{\"o}l{\"o}ni, Ladislau and Lou, Qian},
journal={Advances in Neural Information Processing Systems},
volume={36},
year={2024}
}