/awesome-llm-security

A curation of awesome tools, documents and projects about LLM Security.

Awesome LLM Security Awesome

A curation of awesome tools, documents and projects about LLM Security.

Contributions are always welcome. Please read the Contribution Guidelines before contributing.

Table of Contents

Papers

White-box attack

  • "Visual Adversarial Examples Jailbreak Large Language Models", 2023-06, AAAI(Oral) 24, multi-modal, [paper] [repo]
  • "Are aligned neural networks adversarially aligned?", 2023-06, NeurIPS(Poster) 23, multi-modal, [paper]
  • "(Ab)using Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs", 2023-07, multi-modal [paper]
  • "Universal and Transferable Adversarial Attacks on Aligned Language Models", 2023-07, transfer, [paper] [repo] [page]
  • "Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models", 2023-07, multi-modal, [paper]
  • "Image Hijacking: Adversarial Images can Control Generative Models at Runtime", 2023-09, multi-modal, [paper] [repo] [site]
  • "Weak-to-Strong Jailbreaking on Large Language Models", 2024-04, token-prob, [paper] [repo]

Black-box attack

  • "Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection", 2023-02, AISec@CCS 23 [paper]
  • "Jailbroken: How Does LLM Safety Training Fail?", 2023-07, NeurIPS(Oral) 23, [paper]
  • "Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models", 2023-07, [paper] [repo]
  • "Effective Prompt Extraction from Language Models", 2023-07, prompt-extraction, [paper]
  • "Multi-step Jailbreaking Privacy Attacks on ChatGPT", 2023-04, EMNLP 23, privacy, [paper]
  • "LLM Censorship: A Machine Learning Challenge or a Computer Security Problem?", 2023-07, [paper]
  • "Jailbreaking chatgpt via prompt engineering: An empirical study", 2023-05, [paper]
  • "Prompt Injection attack against LLM-integrated Applications", 2023-06, [paper] [repo]
  • "MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots", 2023-07, time-side-channel, [paper]
  • "GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher", 2023-08, ICLR 24, cipher, [paper] [repo]
  • "Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabilities", 2023-08, [paper]
  • "Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs", 2023-08, [paper] [repo] [dataset]
  • "Detecting Language Model Attacks with Perplexity", 2023-08, [paper]
  • "Open Sesame! Universal Black Box Jailbreaking of Large Language Models", 2023-09, gene-algorithm, [paper]
  • "Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!", 2023-10, ICLR(oral) 24, [paper] [repo] [site] [dataset]
  • "AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models", 2023-10, ICLR(poster) 24, gene-algorithm, new-criterion, [paper]
  • "Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations", 2023-10, CoRR 23, ICL, [paper]
  • "Multilingual Jailbreak Challenges in Large Language Models", 2023-10, ICLR(poster) 24, [paper] [repo]
  • "Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation", 2023-11, SoLaR(poster) 24, [paper]
  • "DeepInception: Hypnotize Large Language Model to Be Jailbreaker", 2023-11, [paper] [repo] [site]
  • "A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily", 2023-11, NAACL 24, [paper] [repo]
  • "AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models", 2023-10, [paper]
  • "Language Model Inversion", 2023-11, ICLR(poster) 24, [paper] [repo]
  • "An LLM can Fool Itself: A Prompt-Based Adversarial Attack", 2023-10, ICLR(poster) 24, [paper] [repo]
  • "GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts", 2023-09, [paper] [repo] [site]
  • "Many-shot Jailbreaking", 2024-04, [paper]
  • "Rethinking How to Evaluate Language Model Jailbreak", 2024-04, [paper] [repo]

Backdoor attack

  • "BITE: Textual Backdoor Attacks with Iterative Trigger Injection", 2022-05, ACL 23, defense [paper]
  • "Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in Language Models", 2023-05, EMNLP 23, [paper]
  • "Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection", 2023-07, NAACL 24, [paper] [repo] [site]

Defense

  • "Baseline Defenses for Adversarial Attacks Against Aligned Language Models", 2023-09, [paper] [repo]
  • "LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked", 2023-08, ICLR 24 Tiny Paper, self-filtered, [paper] [repo] [site]
  • "Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM", 2023-09, random-mask-filter, [paper]
  • "Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models", 2023-12, [paper] [repo]
  • "AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks", 2024-03, [paper] [repo]
  • "Protecting Your LLMs with Information Bottleneck", 2024-04, [paper] [repo]
  • "PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition", 2024-05, ICML 24, [paper] [repo]
  • “Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs”, 2024-06, [paper]

Platform Security

  • "LLM Platform Security: Applying a Systematic Evaluation Framework to OpenAI’s ChatGPT Plugins", 2023-09, [paper] [repo]

Survey

  • "Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks", 2023-10, ACL 24, [paper]
  • "Security and Privacy Challenges of Large Language Models: A Survey", 2024-02, [paper]
  • "Breaking Down the Defenses: A Comparative Survey of Attacks on Large Language Models", 2024-03, [paper]
  • "Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)", 2024-07, [paper]

Tools

  • Plexiglass: a security toolbox for testing and safeguarding LLMs GitHub Repo stars
  • PurpleLlama: set of tools to assess and improve LLM security. GitHub Repo stars
  • Rebuff: a self-hardening prompt injection detector GitHub Repo stars
  • Garak: a LLM vulnerability scanner GitHub Repo stars
  • LLMFuzzer: a fuzzing framework for LLMs GitHub Repo stars
  • LLM Guard: a security toolkit for LLM Interactions GitHub Repo stars
  • Vigil: a LLM prompt injection detection toolkit GitHub Repo stars
  • jailbreak-evaluation: an easy-to-use Python package for language model jailbreak evaluation GitHub Repo stars
  • Prompt Fuzzer: the open-source tool to help you harden your GenAI applications GitHub Repo stars

Articles

Other Awesome Projects

Other Useful Resources

Star History Chart