/safety_realignment

A safety realignment framework via subspace-oriented model fusion for large language models (accepted by KBS)

Primary LanguagePython

A safety realignment framework via subspace-oriented model fusion for large language models (arxiv)

Alt text

1. SFT on downstream tasks

Train dataset

Evaluation dataset

Based on repo: lm_eval

  • Chinese (XCOPA, ./lm_eval/tasks/xcopa/default_zh.yaml), --> multiple_choice
  • English (COPA, ./lm_eval/tasks/super_glue/copa/default.yaml)
  • Hindi (XNLI, ./lm_eval/tasks/xnli/default.yaml)
  • Math (GSM8K, ./lm_eval/tasks/gsm8k/default.yaml)

Code based on repo: instruct_eval

  • Code (HumanEval)
# English specific model downstream performance evaluation after sft
cd scripts/base/downstream_eval
bash alpaca_en-sft.sh

2. Re-aligned fine-tuned models

Prepare safe data for training a safety subspace

Train a safety subspace for a task-specific model

# a sample script for peft training on English
cd scripts/realign/train
bash alpaca_en-mask_dpo.sh

Train a safety subspace for multi model during fusion by ties_merging

# a sample script for peft training on English
cd scripts/multi_realign_realign/train
bash ties_merging-mask.sh

3. Evaluation safety for realignment of fine-tuned models

Automatic Evaluation by GPT3.5-Turbo

Task-specific fine-tuned models

  • five evaluation datasets

    • catqa (./evaluate/harmful_questions/catqa)
    • BeaverTails (./evaluate/harmful_questions/BeaverTails])
    • shadow-alignment (./evaluate/harmful_questions/shadow-alignment)
    • harmfulqa (./evaluate/harmful_questions/harmfulqa)
    • dangerousqa (./evaluate/harmful_questions/dangerousqa)
  • Evaluation safety for SFT fine-tuned model on "english", run the following command:

    • ./scripts/base/safety_eval/alpaca_en-sft.sh,
  • Evaluation safety for realigned model on "english", run the following command:

    • ./scripts/realign/safety_eval/alpaca_en-mask_dpo.sh,`

Fusion model fine-tuned on multi-task datasets

  • Evaluation safety for SFT fine-tuned models fused by task_arithmetic, run the following command:

    • ./scripts/multi_realign/safety_eval/task_arithmetic-sft.sh,
  • Evaluation safety for realigned model on "english", run the following command:

    • ./scripts/multi_realign/safety_eval/task_arithmetic-mask_dpo.sh,`

Notes: Our fine-tuned models is available on Huggingface.

OVERVIEW

.
├── llama_factory/
├── lm_eval/ (eval downstream tasks: COPA, XCOPA, etc.)
├──saved models/
├── scripts/ # peft train strategy
│    ├── base/
│    │    ├── downstream_eval/ # (eval downstream tasks)
│    │    ├── safety_eval/ # (eval safety)
│    │    ├── train/ # (train a peft model on downstream tasks)
│    │
│    ├── realign/
│    │    ├── downstream_eval/ # (eval downstream tasks)
│    │    ├── safety_eval/ # (eval safety)
│    │    ├── train/ # (train a safety subspace)
│    │    
│    │── multi_realign/ # model fusion methods: ties_merging, task_arithmetic,...
│    │    ├── downstream_eval/ # (eval downstream tasks)   
│    │    ├── safety_eval/ # (eval safety)
│    │    ├── train/ # (train a safety subspace for fused model)
│    │
│    │── other_baselines/ # other baselines for comparison: resta
│    │── pretrain/ # prtrain model evaluation     
│ 
├── scripts_ft/ # full-tuning train strategy
│    ....
│    
├── requirements.txt
└── README.md

Citation

If you find this code useful, please cite the following paper:

@inproceedings{xin2024realingment,
  title={A safety realignment framework via subspace-oriented model fusion for large language models},
  author={Xin Yia, Shunfan Zheng, Linlin Wang, Xiaoling Wang and Liang He},
  year={2024},
  url={https://arxiv.org/abs/2405.09055}
  }

Acknowledgement

This codebase is based on the Resta and subspace_fusion. Thanks for their great works and contribution.