A curation of awesome tools, documents and projects about LLM Security.
Contributions are always welcome. Please read the Contribution Guidelines before contributing.
- Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection
- Visual Adversarial Examples Jailbreak Large Language Models
- Jailbroken: How Does LLM Safety Training Fail?
- Are aligned neural networks adversarially aligned?
- Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models
- (Ab)using Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs
- Prompts Should not be Seen as Secrets: Systematically Measuring Prompt Extraction Attack Success
- BITE: Textual Backdoor Attacks with Iterative Trigger Injection
- Multi-step Jailbreaking Privacy Attacks on ChatGPT
- Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in Language Models
- LLM Censorship: A Machine Learning Challenge or a Computer Security Problem?
- Universal and Transferable Adversarial Attacks on Aligned Language Models
- Plug and Pray: Exploiting off-the-shelf components of Multi-Modal Models
- Virtual Prompt Injection for Instruction-Tuned Large Language Models
- Jailbreaking chatgpt via prompt engineering: An empirical study
- Prompt Injection attack against LLM-integrated Applications
- Jailbreaker: Automated Jailbreak Across Multiple Large Language Model Chatbots
- GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher
- LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked
- Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabilities
- Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs
- Detecting Language Model Attacks with Perplexity
- Baseline Defenses for Adversarial Attacks Against Aligned Language Models
- Image Hijacking: Adversarial Images can Control Generative Models at Runtime
- Open Sesame! Universal Black Box Jailbreaking of Large Language Models
- LLM Platform Security: Applying a Systematic Evaluation Framework to OpenAI’s ChatGPT Plugins
- Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM
- Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
- AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
- Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations
- Multilingual Jailbreak Challenges in Large Language Models
- Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks
- Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation
- DeepInception: Hypnotize Large Language Model to Be Jailbreaker
- A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily
- AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models
- Language Model Inversion
- Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models
- An LLM can Fool Itself: A Prompt-Based Adversarial Attack
- Weak-to-Strong Jailbreaking on Large Language Models
- Security and Privacy Challenges of Large Language Models: A Survey
- Plexiglass: a security toolbox for testing and safeguarding LLMs
- PurpleLlama: set of tools to assess and improve LLM security.
- Rebuff: a self-hardening prompt injection detector
- Garak: a LLM vulnerability scanner
- LLMFuzzer: a fuzzing framework for LLMs
- LLM Guard: a security toolkit for LLM Interactions
- Vigil: a LLM prompt injection detection toolkit
- Hacking Auto-GPT and escaping its docker container
- Prompt Injection Cheat Sheet: How To Manipulate AI Language Models
- Indirect Prompt Injection Threats
- Prompt injection: What’s the worst that can happen?
- OWASP Top 10 for Large Language Model Applications
- PoisonGPT: How we hid a lobotomized LLM on Hugging Face to spread fake news
- ChatGPT Plugins: Data Exfiltration via Images & Cross Plugin Request Forgery
- Jailbreaking GPT-4's code interpreter
- Securing LLM Systems Against Prompt Injection
- The AI Attack Surface Map v1.0
- Adversarial Attacks on LLMs
- Gandalf: a prompt injection wargame
- LangChain vulnerable to code injection - CVE-2023-29374
- Jailbreak Chat
- Adversarial Prompting
- Epivolis: a prompt injection aware chatbot designed to mitigate adversarial efforts
- LLM Security Problems at DEFCON31 Quals: the world's top security competition
- PromptBounty.io
- PALLMs (Payloads for Attacking Large Language Models)
- Twitter: @llm_sec
- Blog: LLM Security authored by @llm_sec
- Blog: Embrace The Red
- Blog: Kai's Blog
- Newsletter: AI safety takes
- Newsletter & Blog: Hackstery