Eliciting Language Model Behaviors using Reverse Language Models

Paper

Abstract

Despite advances in fine-tuning methods, language models (LMs) continue to output toxic and harmful responses on worst-case inputs, including adversarial attacks and jailbreaks. We train an LM on tokens in reverse order---a '\textit{reverse LM}---as a tool for identifying such worst-case inputs. By prompting a reverse LM with a problematic string, we can sample prefixes that are likely to precede the problematic suffix. We test our reverse LM by using it to guide beam search for prefixes that have high probability of generating toxic statements when input to a forward LM. Our 160m parameter reverse LM outperforms the existing state-of-the-art adversarial attack method, Greedy Coordinate Gradient ~\citep{zou2023universal}, when measuring the probability of toxic continuations from the Pythia-160m LM. Unlike GCG, our method is black-box and does not require access to model weights to compute gradients. We also find that the prefixes generated by our reverse LM for the Pythia model are more likely to transfer to other models, eliciting toxic responses also from Llama 2 when compared to GCG-generated attacks.