Stealth edits for provably fixing or attacking large language models

Implementation and source code of algorithms from paper: "Stealth edits for provably fixing or attacking large language models".

Getting Started

Before attempting stealth edits, please first install the environment:

conda env create --name=llm-sa -f environment.yml
conda activate llm-sa

The model llama-3-8b requires you to apply for access. Please follow the instructions here. You will also need to install huggingface-cli and input an user access token.
To start playing with stealth edit and attacks, please refer to the Colab Demo and the Huggingface Demo. You can also run the demo locally:
```
python app.py
```

Experiments

To reproduce experiments in the paper, please first run the extraction script:

bash scripts/extract.sh

and then run edits and/or attacks and evaluation with the following scripts:

bash scripts/edit.sh
bash scripts/eval.sh

It is recommended to distribute the experiments on multiple nodes.

How to Cite

@article{sutton2024stealth,
    title = {Stealth Edits for Provably Fixing or Attacking Large Language Models},
    author = {Sutton, Oliver J. and Zhou, Qinghua and Wang, Wei and Higham, Desmond J. and Gorban, Alexander N. and Bastounis, Alexander and Tyukin, Ivan Y.},
    year = {2024},
    month = jun,
    number = {arXiv:2406.12670},
    eprint = {2406.12670},
    primaryclass = {cs},
    publisher = {arXiv},
    doi = {10.48550/arXiv.2406.12670},
    urldate = {2024-06-20},
    archiveprefix = {arXiv},
}

qinghua-zhou/stealth-edits

Stealth edits for provably fixing or attacking large language models

Getting Started

Experiments

How to Cite