DeRa is a simple method to explore and evaluate different regularization strengths in RLHF-aligned models without the need of retraining.
Two main use cases of DeRa are:
- tailor language model alignment strength to specific user preferences or downstream applications
- identify promising regularization strengths to retrain a model, without expensive hyperparameter sweeps
In the colab notebook link above, you'll find a reference implementation of DeRa with HuggingFace transformers 🤗. Specifically, we demonstrate the application of DeRa to the Zephyr-7b model, showing its ability to adjust the alignment levels of language models at decoding time.
Please find more details in our paper, accepted for a spotlight presentation at ICML 2024:
@inproceedings{Liu2024decoding,
title = {Decoding-time Realignment of Language Models},
author={Liu, Tianlin and Guo, Shangmin and Bianco, Leonardo and Calandriello, Daniele and Berthet, Quentin and Llinares, Felipe and Hoffmann, Jessica and Dixon, Lucas and Valko, Michal and Blondel, Mathieu},
booktitle = {Proceedings of the International Conference on Machine Learning},
year = {2024}
}