/lm-ssp

The resources related to the safety, security, and privacy of large models.

LM-SSP

The resources related to the safety, security, and privacy (SSP) of large models (LM). Here LM contains large language models (LLMs), large vision-language models (LVMs), diffusion models, and so on.

  • This repo is in progress 🔥 (currently manually collected)

  • Welcome to recommend resources to us (via Issue/Pull request/Email/...)!

  • Tags: img img img img img img

Books

  • [2024/01] NIST Trustworthy and Responsible AI Reports img

Papers

A. Safety

A1. Jailbreak

  • [2024/01] MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance img img img img
  • [2023/12] Adversarial Attacks on GPT-4 via Simple Random Search img img
  • [2023/11] FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts img img img
  • [2023/10] Adversarial Attacks on LLMs img img
  • [2023/10] AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models img img img
  • [2023/10] AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models img img
  • [2023/10] Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation img img img
  • [2023/10] GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts imgimg img
  • [2023/10] Jailbreaking Black Box Large Language Models in Twenty Queries img img img
  • [2023/09] Open Sesame! Universal Black Box Jailbreaking of Large Language Models img img
  • [2023/07] MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots img img img
  • [2023/07] Jailbroken: How Does LLM Safety Training Fail? img img
  • [2023/07] Universal and Transferable Adversarial Attacks on Aligned Language Models img img img

A2. Safety Alignment

  • [2024/01] A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity img
  • [2023/12] Exploiting Novel GPT-4 APIs img img
  • [2023/10] Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! img img img
  • [2023/10] Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models img img img
  • [2023/10] UltraFeedback: Boosting Language Models with High-quality Feedback img img img

A3. Toxicity

  • [2023/08] You Only Prompt Once: On the Capabilities of Prompt Learning on Large Language Models to Tackle Toxic Content img img img
  • [2023/05] Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models img img img

A4. Deepfake

  • [2023/03] MGTBench: Benchmarking Machine-Generated Text Detection img img img
  • [2022/10] DE-FAKE: Detection and Attribution of Fake Images Generated by Text-to-Image Generation Models img img img

A5. Agent

  • [2023/11] Evil Geniuses: Delving into the Safety of LLM-based Agents img img

B. Security

B1. Adversarial Attacks

  • [2024/01] INSTRUCTTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models img img
  • [2023/08] Robustness Over Time: Understanding Adversarial Examples' Effectiveness on Longitudinal Versions of Large Language Models img img
  • [2023/06] PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts img img
  • [2022/05] Diffusion Models for Adversarial Purification img img img img

B2. Code Generation

  • [2023/02] Large Language Models for Code: Security Hardening and Adversarial Testing img img img
  • [2022/11] Do Users Write More Insecure Code with AI Assistants? img img img

B3. Backdoor/Poisoning

  • [2023/05] Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models img img
  • [2022/11] LMSanitator: Defending Prompt-Tuning Against Task-Agnostic Backdoors img img img img

C. Privacy

C1. Data Reconstruction

  • [2023/11] Scalable Extraction of Training Data from (Production) Language Models img img
  • [2023/01] Extracting Training Data from Diffusion Models img img img
  • [2020/12] Extracting Training Data from Large Language Models img img img

C2. Membership Inference

  • Coming soon!

C3. Property Inference

  • [2023/10] Beyond Memorization: Violating Privacy Via Inference with Large Language Models img img

C4. Model Extraction

  • [2023/03] Stealing the Decoding Algorithms of Language Models img img img

C5. Unlearning

  • [2023/10] Unlearn What You Want to Forget: Efficient Unlearning for LLMs img img img
  • [2023/10] Who's Harry Potter? Approximate Unlearning in LLMs img img img
  • [2023/03] Erasing Concepts from Diffusion Models img img img

C6. Copyright

  • [2024/01] Generative AI Has a Visual Plagiarism Problem img
  • [2023/11] Protecting Intellectual Property of Large Language Model-Based Code Generation APIs via Watermarks img img img img