/TrojLLM

Primary LanguagePythonMIT LicenseMIT

TrojLLM [Paper]

This repository contains code for our NeurIPS 2023 paper "TrojLLM: A Black-box Trojan Prompt Attack on Large Language Models". In this paper, we propose TrojLLM, an automatic and black-box framework to effectively generate universal and stealthy triggers and inserts trojans into the hard prompts of LLM-based APIs.

Overview

The workflow of TrojLLM. detector

Environment Setup

Our codebase requires the following Python and PyTorch versions:
Python --> 3.11.3
PyTorch --> 2.0.1

Usage

We have split the code into three parts:

  1. PromptSeed/ : Prompt Seed Tuning
  2. Trigger/ : Universal Trigger Optimization
  3. ProgressiveTuning/ : Progressive Prompt Poisoning

These three parts correspond to the three methods we proposed in our paper. Please refer to the corresponding folder for more details.

Citation

If you find TrojLLM useful or relevant to your project and research, please kindly cite our paper:

@article{xue2024trojllm,
  title={Trojllm: A black-box trojan prompt attack on large language models},
  author={Xue, Jiaqi and Zheng, Mengxin and Hua, Ting and Shen, Yilin and Liu, Yepeng and B{\"o}l{\"o}ni, Ladislau and Lou, Qian},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  year={2024}
}