DeRa: Decoding-time Realignment of Language Models

DeRa is a simple method to explore and evaluate different regularization strengths in RLHF-aligned models without the need of retraining.

Two main use cases of DeRa are:

tailor language model alignment strength to specific user preferences or downstream applications
identify promising regularization strengths to retrain a model, without expensive hyperparameter sweeps

In the colab notebook link above, you'll find a reference implementation of DeRa with HuggingFace transformers 🤗. Specifically, we demonstrate the application of DeRa to the Zephyr-7b model, showing its ability to adjust the alignment levels of language models at decoding time.

Please find more details in our paper, accepted for a spotlight presentation at ICML 2024:

@inproceedings{Liu2024decoding,
 title = {Decoding-time Realignment of Language Models},
 author={Liu, Tianlin and Guo, Shangmin and Bianco, Leonardo and Calandriello, Daniele and Berthet, Quentin and Llinares, Felipe and Hoffmann, Jessica and Dixon, Lucas and Valko, Michal and Blondel, Mathieu},
 booktitle = {Proceedings of the International Conference on Machine Learning},
 year = {2024}
}

liutianlin0121/decoding-time-realignment

DeRa: Decoding-time Realignment of Language Models