/smooth-llm

Primary LanguagePythonMIT LicenseMIT

SmoothLLM

License: MIT

This is the official source code for "SmoothLLM: Defending LLMs Against Jailbreaking Attacks" by Alex Robey, Eric Wong, Hamed Hassani, and George J. Pappas. To learn more about our work, see our blog post.

Introduction to SmoothLLM

Installation

Step 1: Create an empty virtual environment.

conda create -n smooth-llm python=3.10
conda activate smooth-llm

Step 2: Install the source code for "Universal and Transferable Adversarial Attacks on Aligned Language Models."

git clone https://github.com/llm-attacks/llm-attacks.git
cd llm-attacks
pip install -e .

Step 3: Download the weights for Vicuna and/or Llama2 from HuggingFace.

Step 4: Change the paths to the model and tokenizer in lib/model_configs.py depending on which set(s) of weights you downloaded in Step 3.

MODELS = {
    'llama2': {
        'model_path': '/shared_data0/arobey1/llama-2-7b-chat-hf',
        'tokenizer_path': '/shared_data0/arobey1/llama-2-7b-chat-hf',
        'conversation_template': 'llama-2'
    },
    'vicuna': {
        'model_path': '/shared_data0/arobey1/vicuna-13b-v1.5',
        'tokenizer_path': '/shared_data0/arobey1/vicuna-13b-v1.5',
        'conversation_template': 'vicuna'
    }
}

The conversation_template value is used to initialize a fastchat conversation template.

Experiments

We provide ten adversarial suffix generated by running GCG for Vicuna and Llama2 in the data/ directory. You can run SmoothLLM by running:

python main.py \
    --results_dir ./results \
    --target_model vicuna \
    --attack GCG \
    --attack_logfile data/GCG/vicuna_behaviors.json \
    --smoothllm_pert_type RandomSwapPerturbation \
    --smoothllm_pert_pct 10 \
    --smoothllm_num_copies 10

You can also change SmoothLLM's hyperparameters---the number of copies, the perturbation percentage, and the perturbation function---by changing the named arguments. At present, we support three kinds of perturbations: swaps, patches, and insertions. For more details, see Algorithm 2 in our paper. To use these functions, you can replace the --perturbation_type value with RandomSwapPerturbation, RandomPatchPerturbation, or RandomInsertPerturbation.

Reproducibility

The following codebases have reimplemented our results:

Citation

If you find this codebase useful in your research, please consider citing:

@article{robey2023smoothllm,
  title={SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks},
  author={Robey, Alexander and Wong, Eric and Hassani, Hamed and Pappas, George J},
  journal={arXiv preprint arXiv:2310.03684},
  year={2023}
}

License

smooth-llm is licensed under the terms of the MIT license. See LICENSE for more details.