/Awesome-Foundation-Model-Security

A curated list of trustworthy Generative AI papers. Daily updating...

Awesome License: MIT

🚩 In recent times, there have been notable advancements in foundation models, such as diffusion models in image processing, large language models in text processing, and even multimodal foundation models in speech and video processing. The swift progress in foundation models demands that we take into account their security concerns. This repository primarily concentrates on the security challenges posed by foundation models, with a specific emphasis on utilizing diffusion models to tackle crucial issues in adversarial machine learning.

twitter: @llm_sec @topofmlsafety

Table of Contents

Survey

  • Diffusion Models in NLP: A Survey [ArXiv '23]
  • A Survey of Large Language Models [ArXiv '23] [code]
  • Diffusion Models: A Comprehensive Survey of Methods and Applications [ArXiv '22] [code]
  • Shortcut Learning of Large Language Models in Natural Language Understanding: A Survey [ArXiv '22]
  • A Survey of Evaluation Metrics Used for NLG Systems [CUSR '22]

Awesome

  • A collection of resources and papers on Diffusion Models [code]
  • Tracking Papers on Diffusion Models [code]
  • A collection of papers and resources related to Large Language Models [code]
  • Awesome-LLM: a curated list of Large Language Model [code]
  • Awesome-LLM-Uncertainty-Reliability-Robustness [code]
  • A Complete List of All (arXiv) Adversarial Example Papers [code]

Representative Model

  • Improved Denoising Diffusion Probabilistic Models [ICML '21] [code]
  • CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis [ICLR '23] [code]
  • Learning Transferable Visual Models From Natural Language Supervision [ICML '21] [code]
  • Minigpt-4: Enhancing vision-language understanding with advanced large language models [ArXiv '23] [code]

Risks of Model

Evasion Attacks

  • Universal and Transferable Adversarial Attacks on Aligned Language Models [ArXiv '23] [code]
  • Jailbreaker: Automated Jailbreak Across Multiple Large Language Model Chatbots [ArXiv '23]
  • Are aligned neural networks adversarially aligned? [ArXiv '23]
  • Visual Adversarial Examples Jailbreak Large Language Models [ArXiv '23] [code]
  • Adversarial Demonstration Attacks on Large Language Models [ArXiv '23]
  • Adversarial Prompting for Black Box Foundation Models [ArXiv '23]
  • Open Sesame! Universal Black Box Jailbreaking of Large Language Models [ArXiv '23]

Prompt Injection

  • Prompt Injection attack against LLM-integrated Applications [ArXiv '23]
  • Black Box Adversarial Prompting for Foundation Models [ArXiv '23]
  • More than you've asked for: A Comprehensive Analysis of Novel Prompt Injection Threats to Application-Integrated Large Language Models [ArXiv '23] [code]

Poisoning

  • On the Exploitability of Instruction Tuning [ArXiv '23]
  • UOR: Universal Backdoor Attacks on Pre-trained Language Models [ArXiv '23]
  • Backdooring Neural Code Search [ACL '23]
  • Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models [[ArXiv '23]
  • BadGPT: Exploring Security Vulnerabilities of ChatGPT via Backdoor Attacks to InstructGPT [NDSS '23 Poster]
  • Analyzing And Editing Inner Mechanisms of Backdoored Language Models [ArXiv '23]
  • Poisoning Language Models During Instruction Tuning [ICML '23] [code]
  • Text-to-Image Diffusion Models can be Easily Backdoored through Multimodal Data Poisoning [ArXiv '23]
  • TrojDiff: Trojan Attacks on Diffusion Models with Diverse Targets [CVPR '23] [code]

Privacy

  • ProPILE: Probing Privacy Leakage in Large Language Models [ArXiv '23]
  • Prompt Stealing Attacks Against Text-to-Image Generation Models [ArXiv '23]
  • Multi-step Jailbreaking Privacy Attacks on ChatGPT [ArXiv '23]
  • Extracting training data from diffusion models [ArXiv '23]
  • Extracting Training Data from Large Language Models [USENIX '21] [code]
  • Are Diffusion Models Vulnerable to Membership Inference Attacks? [ArXiv '23]
  • On the Risks of Stealing the Decoding Algorithms of Language Models [ArXiv '23]
  • Sentence Embedding Leaks More Information than You Expect: Generative Embedding Inversion Attack to Recover the Whole Sentence [ACL '23]

Evaluation

  • Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models [ArXiv '23]
  • DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models [ArXiv '23]
  • PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts [ArXiv '23]
  • SneakyPrompt: Evaluating Robustness of Text-to-image Generative Models' Safety Filters [ArXiv '23]
  • Model evaluation for extreme risks [ArXiv '23]
  • LEVER: Learning to Verify Language-to-Code Generation with Execution [ArXiv] [code]
  • Holistic Evaluation of Language Models [ArXiv '22] [code] [project]
  • How Secure is Code Generated by ChatGPT? [ArXiv '22] [code]
  • On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective [ICLR '23 workshop] [code]
  • How Robust is GPT-3.5 to Predecessors? A Comprehensive Study on Language Understanding Tasks [ArXiv '23]
  • LEVER: Learning to Verify Language-to-Code Generation with Execution [ArXiv '23] [code]
  • Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation [ArXiv '23] [code]

Improving

  • Prompting GPT-3 To Be Reliable [ICLR '23] [code]
  • Privacy-Preserving Prompt Tuning for Large Language Model Services [ArXiv '23]
  • Differentially Private Diffusion Models Generate Useful Synthetic Images [ArXiv '23]
  • DE-FAKE: Detection and Attribution of Fake Images Generated by Text-to-Image Generation Models [CCS '23]

Face to Security

Attack

  • Controlling Large Language Models to Generate Secure and Vulnerable Code [ArXiv '23]
  • ChatGPT as an Attack Tool: Stealthy Textual Backdoor Attack via Blackbox Generative Model Trigger [ArXiv '23]
  • Diffusion Models for Imperceptible and Transferable Adversarial Attack [ArXiv '23] [code]
  • Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models [CCS '23]

Adversarial Robustness

  • A Prompting-based Approach for Adversarial Example Generation and Robustness Enhancement [ArXiv '23]
  • Improving Adversarial Robustness by Contrastive Guided Diffusion Process [ICML '23]
  • DensePure: Understanding Diffusion Models towards Adversarial Robustness [ICLR '23] [code]
  • DiffSmooth: Certifiably Robust Learning via Diffusion Models and Local Smoothing [USENIX '23]
  • Defending against Adversarial Audio via Diffusion Model [CVPR '23] [code]
  • Diffusion Models for Adversarial Purification [ICML '22] [code]
  • Better Diffusion Models Further Improve Adversarial Training [ICML '23] [code]
  • Denoising Diffusion Probabilistic Models as a Defense against Adversarial Attacks [ArXiv '22] [code]
  • Robust Learning Meets Generative Models: Can Proxy Distributions Improve Adversarial Robustness? [ICLR '22] [code]
  • Adversarial Purification with Score-based Generative Models [ICML '21] [code]

Workshop&Talks