/LLM-Conversation-Safety

[NAACL2024] Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey

LLM Conversation Safety

🌟Accepted to NAACL 2024 main conference🌟

This is a collection of research papers of LLM Conversation Safety.

The organization of papers refers to our survey Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey. If you find our survey useful for your research, please cite the following paper:

@misc{dong2024attacks,
      title={Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey}, 
      author={Zhichen Dong and Zhanhui Zhou and Chao Yang and Jing Shao and Yu Qiao},
      year={2024},
      eprint={2402.09283},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

If you find out a mistake or any related materials could be helpful, feel free to contact us or make a PR🌟.

📌Table of Contents

📑Paper List

💣Attacks

🎯Inference-time Attacks

📜Red-team Attacks
  • Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

    • Deep Ganguli, Liane Lovitt, Jackson Kernion, et al.
    • Summary:
      • Assess and mitigate the potentially harmful outputs of language models through red teaming.
      • The work explores the scaling behaviors of different model types and sizes, releases a dataset of red team attacks, and provides detailed instructions and methodologies for red teaming.
  • Red Teaming Language Models with Language Models

    • Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, Geoffrey Irving

    • Summary:

      • Obtain a red-teaming LLM by fine-tuning, which generates harmful cases where a target LLM behaves harmfully.
      • Collect successful attacking prompts from base LLM, which are used as fine-tuning data.
  • RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models

    • Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, Noah A. Smith
    • Summary:
      • Introduce a dataset called RealToxicityPrompts, comprising 100,000 naturally occurring prompts paired with toxicity scores.
      • Analyze web text corpora used for pretraining and identify offensive, unreliable, and toxic content.
  • Trick Me If You Can: Human-in-the-loop Generation of Adversarial Examples for Question Answering

    • Eric Wallace, Pedro Rodriguez, Shi Feng, Ikuya Yamada, Jordan Boyd-Graber
    • Summary:
      • Introduce a human-in-the-loop adversarial generation where human authors are guided to create adversarial examples that challenge models.
      • The generated adversarial questions cover various phenomena and highlight the challenges in robust question answering.
  • Adversarial Training for High-Stakes Reliability

    • Daniel M. Ziegler, Seraphina Nix, Lawrence Chan, Tim Bauman, Peter Schmidt-Nielsen, Tao Lin, Adam Scherlis, Noa Nabeshima, Ben Weinstein-Raun, Daniel de Haas, Buck Shlegeris, Nate Thomas
    • Summary:
      • Develop adversarial training techniques, including a tool to assist human adversaries, to identify and eliminate failures in a text completion classifier.
  • Explore, Establish, Exploit: Red Teaming Language Models from Scratch

    • Stephen Casper, Jason Lin, Joe Kwon, Gatlen Culp, Dylan Hadfield-Menell
    • Summary:
      • Propose a "from scratch" red-teaming approach where the adversary does not have a pre-existing classification mechanism.
      • Red-team GPT-3 and create the CommonClaim dataset of 20,000 statements labeled as common-knowledge-true, common-knowledge-false, or neither.
  • FLIRT: Feedback Loop In-context Red Teaming

    • Ninareh Mehrabi, Palash Goyal, Christophe Dupuy, Qian Hu, Shalini Ghosh, Richard Zemel, Kai-Wei Chang, Aram Galstyan, Rahul Gupta
    • Summary:
      • Employ in-context learning in a feedback loop to trigger models into producing unsafe and inappropriate content.

🎬Template-based Attacks
🔹Heuristic-based Templates

🔹Optimization-based Templates

🔮Neural Prompt-to-Prompt Attacks
  • Large Language Models as Optimizers

    • Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V. Le, Denny Zhou, Xinyun Chen
    • Summary:
      • Use an LLM as an optimizer to progressively improve prompts and addressing problems such as SAT.
  • Jailbreaking Black Box Large Language Models in Twenty Queries

    • Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, Eric Wong
    • Summary:
      • Use a base LLM as an optimizer to progressively refine inputs based on the interactive feedback from the target LLM.
  • Tree of Attacks: Jailbreaking Black-Box LLMs Automatically

    • Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, Amin Karbasi
    • Summary:
      • Leverage LLM-based modify-and-search techniques to improve input prompts through tree-based modification and search methods.
  • Evil Geniuses: Delving into the Safety of LLM-based Agents

    • Yu Tian, Xiao Yang, Jingyuan Zhang, Yinpeng Dong, Hang Su

    • Repo: https://github.com/T1aNS1R/Evil-Geniuses

    • Summary:

      • A multiple-agent system with agent roles specified by system prompt.
      • Develop a virtual evil plan team using LLM, consisting of a harmful prompt writer, a suitability reviewer, and a toxicity tester, to optimize prompts through iterative modifications and assessments until the attack is successful or predefined termination conditions are met.
  • MART: Improving LLM Safety with Multi-round Automatic Red-Teaming

    • Suyu Ge, Chunting Zhou, Rui Hou, Madian Khabsa, Yi-Chia Wang, Qifan Wang, Jiawei Han, Yuning Mao
    • Summary:
      • Train an LLM to iteratively improve red prompts from the existing ones through adversarial interactions between attack and defense models

🚅Training-time Attacks


🔒Defenses

💪LLM Safety Alignment

  • Direct Preference Optimization: Your Language Model is Secretly a Reward Model
    • Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn
    • Repo: eric-mitchell/direct-preference-optimization (github.com)
    • Summary:
      • Present Direct Preference Optimization (DPO), a novel approach to fine-tuning large-scale unsupervised language models (LMs) to align with human preferences without the complexities and instabilities associated with reinforcement learning from human feedback (RLHF).
      • By reparameterizing the reward model used in RLHF, DPO allows for the extraction of the optimal policy through a simple classification loss, bypassing the need for sampling or extensive hyperparameter adjustments.
  • Safe RLHF: Safe Reinforcement Learning from Human Feedback
    • Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, Yaodong Yang
    • Summary:
      • Introduce Safe Reinforcement Learning from Human Feedback (Safe RLHF), an innovative approach designed to align large language models (LLMs) with human values by separately addressing the dual objectives of helpfulness and harmlessness.
      • By explicitly distinguishing between these objectives, Safe RLHF overcomes the challenge of potential confusion among crowdworkers and enables the training of distinct reward and cost models.
  • Fine-Grained Human Feedback Gives Better Rewards for Language Model Training
    • Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A. Smith, Mari Ostendorf, Hannaneh Hajishirzi
    • Repo: https://finegrainedrlhf.github.io/
    • Summary:
      • Introduce Fine-Grained Reinforcement Learning from Human Feedback (Fine-Grained RLHF), a novel approach that leverages detailed human feedback to improve language models (LMs).
      • Fine-Grained RLHF obtains explicit feedback on specific segments of text output, such as sentences or sub-sentences, and on various types of errors, including factual inaccuracies, irrelevance, and incompleteness.
  • Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization
    • Zhanhui Zhou, Jie Liu, Chao Yang, Jing Shao, Yu Liu, Xiangyu Yue, Wanli Ouyang, Yu Qiao
    • Summary:
      • Introduce Multi-Objective Direct Preference Optimization (MODPO), an innovative, resource-efficient algorithm that extends the concept of Direct Preference Optimization (DPO) to address the challenge of aligning large language models (LMs) with diverse human preferences across multiple dimensions (e.g., helpfulness, harmlessness, honesty).
      • Unlike traditional multi-objective reinforcement learning from human feedback (MORLHF), which requires complex and unstable fine-tuning processes for each set of objectives, MODPO simplifies this by integrating language modeling directly with reward modeling.
  • Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions
    • Federico Bianchi, Mirac Suzgun, Giuseppe Attanasio, Paul Röttger, Dan Jurafsky, Tatsunori Hashimoto, James Zou
    • Summary:
      • Create a safety instruction-response dataset with GPT-3.5-turbo for instruction tuning.
      • The work finds that overly safe allignment can be detrimental (exagerrate safety), as LLM tends to refuse answering safe instructions, leading to a degrade in helpfulness.

😷Inference Guidance


☔Input/Output Filters

🎨Rule-based Filters
📷Model-based Filters
  • Automatic identification of personal insults on social news sites

    • Sara Owsley Sood, Elizabeth F. Churchill, Judd Antin
    • Summary:
      • Train support Vector Machines (SVMs) with personal insult data from social news sites to detect inappropriate negative user contributions.
  • Antisocial Behavior in Online Discussion Communities

    • Justin Cheng, Cristian Danescu-Niculescu-Mizil, Jure Leskovec
    • Summary:
      • Analyze strategies for detecting irrelevant content and early identification of problematic users.
  • Abusive Language Detection in Online User Content

    • Chikashi Nobata, Joel Tetreault, Achint Thomas, Yashar Mehdad, Yi Chang
    • Summary:
      • Present a machine learning method to detect hate speech in online comments.
  • Ex Machina: Personal Attacks Seen at Scale

    • Ellery Wulczyn, Nithum Thain, Lucas Dixon
    • Summary:
      • Develop a method using crowdsourcing and machine learning to analyze personal attacks on online platforms, particularly focusing on English Wikipedia.
      • Introduce a classifier evaluated by its ability to approximate the judgment of crowd-workers, resulting in a corpus of over 100,000 human-labeled and 63 million machine-labeled comments.
  • Defending Against Neural Fake News

    • Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, Yejin Choi
    • Summary:
      • Introduce Grover, a model capable of generating convincing articles from headlines, which poses both threats and opportunities for countering disinformation.
      • Grover can be the most effective tool in distinguishing between real news and neural fake news, achieving 92% accuracy.
  • Detecting Hate Speech with GPT-3

    • Ke-Li Chiu, Annie Collins, Rohan Alexander
    • Summary:
      • Utilize GPT-3 to identify sexist and racist text passages using zero-shot, one-shot, and few-shot learning approaches.
  • Hypothesis Engineering for Zero-Shot Hate Speech Detection

    • Janis Goldzycher, Gerold Schneider
    • Summary:
      • Propose a approach to enhance English NLI-based zero-shot hate speech detection by combining multiple hypotheses.
  • DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing

    • Pengcheng He, Jianfeng Gao, Weizhu Chen
    • Repo: microsoft/DeBERTa: The implementation of DeBERTa (github.com)
    • Summary:
      • Introduce DeBERTaV3, an enhancement of the original DeBERTa model, by implementing replaced token detection (RTD) in place of mask language modeling (MLM) for more efficient pre-training.
  • A Holistic Approach to Undesired Content Detection in the Real World

    • Todor Markov, Chong Zhang, Sandhini Agarwal, Tyna Eloundou, Teddy Lee, Steven Adler, Angela Jiang, Lilian Weng
    • Summary:
      • Introduce a comprehensive strategy for developing a reliable and effective natural language classification system aimed at moderating online content.
      • The proposed system is adept at identifying various types of inappropriate content, including sexual material, hate speech, violence, self-harm, and harassment, and offers a scalable solution that can adapt to different content classification needs.
  • Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield

    • Jinhwa Kim, Ali Derakhshan, Ian G. Harris
    • Summary:
      • Propose the Adversarial Prompt Shield (APS), a model designed to enhance safety by effectively detecting and mitigating harmful responses.
      • Also introduce Bot Adversarial Noisy Dialogue (BAND) datasets for training purposes to improve the resilience of safety classifiers against adversarial inputs.
  • NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails

    • Traian Rebedea, Razvan Dinu, Makesh Sreedhar, Christopher Parisien, Jonathan Cohen
    • Summary:
      • Introduce an innovative open-source toolkit designed to integrate programmable guardrails into Large Language Model (LLM)-based conversational systems, enhancing their safety and controllability.
      • NeMo Guardrails allows for the addition of user-defined, interpretable guardrails at runtime, independent of the underlying LLM.

✏️Evaluations

📖Datasets


🔍Metrics