MUSE: Machine Unlearning Six-Way Evaluation for Language Models

This repository provides the original implementation of Machine Unlearning Six-Way Evaluation for Language Models by Weijia Shi*, Jaechan Lee*, Yangsibo Huang*, Sadhika Malladi, Jieyu Zhao, Ari Holtzman, Daogao Liu, Luke Zettlemoyer, Noah A. Smith, and Chiyuan Zhang. (*Equal contribution)

Overview

MUSE is a comprehensive machine unlearning evaluation benchmark that assesses six desirable properties for unlearned models: (1) no verbatim memorization, (2) no knowledge memorization, (3) no privacy leakage, (4) utility preservation for non-removed data, (5) scalability with respect to removal requests, and (6) sustainability over sequential unlearning requests.

⭐ If you find our implementation and paper helpful, please consider citing our work ⭐ :

@article{shi2024muse,
        title={MUSE: Machine Unlearning Six-Way Evaluation for Language Models},
        author={Weijia Shi and Jaechan Lee and Yangsibo Huang and Sadhika Malladi and Jieyu Zhao and Ari Holtzman and Daogao Liu and Luke Zettlemoyer and Noah A. Smith and Chiyuan Zhang},
        year={2024},
        eprint={2407.06460},
        archivePrefix={arXiv},
        primaryClass={cs.CL},
        url={https://arxiv.org/abs/2407.06460},
}

Content

MUSE: Machine Unlearning Six-Way Evaluation for Language Models
🔍 Evaluation of Unlearned Models

🛠️ Installa and `Newstion

Conda Environment

To create a conda environment for Python 3.10, run:

conda env create -f environment.yml
conda activate py310

📘 Data & Target Models

Two corpora News and Books and the associated target models are available as follows:

Domain	Target Model for Unlearning	Dataset
News	Target model	Dataset
Books	Target model	Dataset

Before proceeding, load all the data from HuggingFace to the root of this repostiory by running the following instruction:

python load_data.py

🚀 Run unlearning baselines

To unlearn the target model using our baseline method, run unlearn.py in the baselines folder. Example scripts baselines/scripts/unlearn_news.sh and scripts/unlearn_books.sh in the baselines folder demonstrate the usage of unlearn.py. Here is an example:

algo="ga"
CORPUS="news"

python unlearn.py \
        --algo $algo \
        --model_dir $TARGET_DIR --tokenizer_dir 'meta-llama/Llama-2-7b-hf' \
        --data_file $FORGET --retain_data_file $RETAIN \
        --out_dir "./ckpt/$CORPUS/$algo" \
        --max_len $MAX_LEN --epochs $EPOCHS --lr $LR \
        --per_device_batch_size $PER_DEVICE_BATCH_SIZE

algo: Unlearning algorithm to run (ga, ga_gdr, ga_klr, npo, npo_gdr, npo_klr, or tv).
model_dir: Directory of the target model.
tokenizer_dir: Directory of the tokenizer.
data_file: Forget set.
retain_data_file: Retain set for GDR/KLR regularizations if required by the algorithm.
out_dir: Directory to save the unlearned model (default: ckpt).
max_len: Maximum input length (default: 2048).
per_device_batch_size, epochs, lr: Hyperparameters.

Resulting models are saved in the ckpt folder as shown:

ckpt
├── news/
│   ├── ga/
│   │   ├── checkpoint-102
│   │   ├── checkpoint-204
│   │   ├── checkpoint-306
│   │   └── ...
│   └── npo/
│       └── ...
└── books/
    ├── ga
    └── ...

🔍 Evaluation of Unlearned Models

To evaluate your unlearned model(s), run eval.py from the root of this repository with the following command-line arguments:

--model_dirs: A list of directories containing the unlearned models. These can be either HuggingFace model directories or local storage paths.
--names: A unique name assigned to each unlearned model in --model_dirs. The length of --names should match the length of --model_dirs.
--corpus: The corpus to use for evaluation. Options are news or books.
--out_file: The name of the output file. The file will be in CSV format, with each row corresponding to an unlearning method from --model_dirs, and columns representing the metrics specified by --metrics.
--tokenizer_dir (Optional): The directory of the tokenizer. Defaults to meta-llama/Llama-2-7b-hf, which is the default tokenizer for LLaMA.
--metrics (Optional): The metrics to evaluate. Options are verbmem_f (VerbMem Forget), privleak (PrivLeak), knowmem_f (KnowMem Forget), and knowmem_r (Knowmem Retain, i.e., Utility). Defaults to evaluating all these metrics.
--temp_dir (Optional): The directory for saving intermediate computations. Defaults to temp.

Example Command

Run the following command with placeholder values:

python eval.py \
  --model_dirs "jaechan-repo/model1" "jaechan-repo/model2" \
  --names "model1" "model2" \
  --corpus books \
  --out_file "out.csv"

Customize Your Evaluation

You may want to customize the evaluation for various reasons, such as:

Your unlearning method is applied at test-time (e.g., interpolates the logits of multiple model outputs), so there is no saved checkpoint for your unlearned model.
You want to use a different dataset for evaluation.
You want to use an MIA metric other than the default one (Min-40%) for the PrivLeak calculation, such as PPL (perplexity).
You want to change the number of tokens greedily decoded from the model when computing VerbMem or KnowMem.

Click here if interested in customization:

The maximum amount of customization that we support is through the eval_model function implemented in eval.py. This function runs the evaluation for a single unlearned model and outputs a Python dictionary that corresponds to one row in the aforementioned CSV. Any additional logic—such as loading the model from a local path, evaluating multiple models, or locally saving an output dictionary—is expected to be implemented by the client.

`eval_model` Function

The function accepts the following arguments:

model: An instance of LlamaForCausalLM. Any HuggingFace model with self.forward and self.generate implemented should suffice.
tokenizer: An instance of LlamaTokenizer.
metrics (Optional): Same as earlier.
corpus (Optional): The corpus to run the evaluation on, either news or books. Defaults to None, which means the client must provide all the data required by the calculations of metrics.
privleak_auc_key (Optional): The MIA metric to use for calculating PrivLeak. Defaults to forget_holdout_Min-40%. The first keyword (forget) corresponds to the non-member data (options: forget, retain, holdout), the second keyword (holdout) corresponds to the member data (options: forget, retain, holdout), and the last keyword (Min-40%) corresponds to the metric name (options: Min-20%, Min-40%, Min-60%, PPL, PPL/lower, PPL/zlib).
verbmem_agg_key (Optional): The Rouge-like metric to use for calculating VerbMem. Defaults to mean_rougeL. Other options include max_rougeL, mean_rouge1, and max_rouge1.
verbmem_max_new_tokens (Optional): The maximum number of new tokens for VerbMem evaluation. Defaults to 128.
knowmem_agg_key (Optional): The Rouge-like metric to use for calculating KnowMem. Defaults to mean_rougeL.
knowmem_max_new_tokens (Optional): The maximum number of new tokens for KnowMem evaluation. Defaults to 32.
verbmem_forget_file, privleak_forget_file, privleak_retain_file, privleak_holdout_file, knowmem_forget_qa_file, knowmem_forget_qa_icl_file, knowmem_retain_qa_file, knowmem_retain_qa_icl_file (Optional): Specifying these file names overrides the corresponding data files. For example, setting corpus='news' and specifying privleak_retain_file would only override privleak_retain_file; all other files default to those associated with the news corpus by default.
temp_dir (Optional): Same as earlier.

➕ Add to the Leaderboard

Submit the output CSV file generated by eval.py to our HuggingFace leaderboard. You are additionally asked to specify the corpus name (either news or books) that your model(s) were evaluated on, the name of your organization, and your email.

swj0419/muse_bench