/stealth-edits

Stealth edits to large language models

Primary LanguageJupyter NotebookGNU Affero General Public License v3.0AGPL-3.0

Stealth edits for provably fixing or attacking large language models

Open in Colab Hugging Face Spaces

Implementation and source code of algorithms from paper: "Stealth edits for provably fixing or attacking large language models".

Getting Started

  1. Before attempting stealth edits, please first install the environment:

    conda env create --name=llm-sa -f environment.yml
    conda activate llm-sa
  2. The model llama-3-8b requires you to apply for access. Please follow the instructions here. You will also need to install huggingface-cli and input an user access token.

  3. To start playing with stealth edit and attacks, please refer to the Colab Demo and the Huggingface Demo. You can also run the demo locally:

    python app.py

Experiments

To reproduce experiments in the paper, please first run the extraction script:

bash scripts/extract.sh

and then run edits and/or attacks and evaluation with the following scripts:

bash scripts/edit.sh
bash scripts/eval.sh

It is recommended to distribute the experiments on multiple nodes.

How to Cite

@article{sutton2024stealth,
    title = {Stealth Edits for Provably Fixing or Attacking Large Language Models},
    author = {Sutton, Oliver J. and Zhou, Qinghua and Wang, Wei and Higham, Desmond J. and Gorban, Alexander N. and Bastounis, Alexander and Tyukin, Ivan Y.},
    year = {2024},
    month = jun,
    number = {arXiv:2406.12670},
    eprint = {2406.12670},
    primaryclass = {cs},
    publisher = {arXiv},
    doi = {10.48550/arXiv.2406.12670},
    urldate = {2024-06-20},
    archiveprefix = {arXiv},
}