TrojLLM [Paper]

This repository contains code for our NeurIPS 2023 paper "TrojLLM: A Black-box Trojan Prompt Attack on Large Language Models". In this paper, we propose TrojLLM, an automatic and black-box framework to effectively generate universal and stealthy triggers and inserts trojans into the hard prompts of LLM-based APIs.

Overview

The workflow of TrojLLM.

Environment Setup

Our codebase requires the following Python and PyTorch versions:
Python --> 3.11.3
PyTorch --> 2.0.1

Usage

We have split the code into three parts:

PromptSeed/ : Prompt Seed Tuning
Trigger/ : Universal Trigger Optimization
ProgressiveTuning/ : Progressive Prompt Poisoning

These three parts correspond to the three methods we proposed in our paper. Please refer to the corresponding folder for more details.

Citation

If you find TrojLLM useful or relevant to your project and research, please kindly cite our paper:

@article{xue2024trojllm,
  title={Trojllm: A black-box trojan prompt attack on large language models},
  author={Xue, Jiaqi and Zheng, Mengxin and Hua, Ting and Shen, Yilin and Liu, Yepeng and B{\"o}l{\"o}ni, Ladislau and Lou, Qian},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  year={2024}
}

UCF-ML-Research/TrojLLM

TrojLLM [Paper]

Overview

Environment Setup

Usage

Citation