English | 中文
Welcome to our Awesome-llm-safety repository! 🥰🥰🥰
🔥 News
🧑💻 Our Work
We've curated a collection of the latest 😋, most comprehensive 😎, and most valuable 🤩 resources on large language model safety (llm-safety). But we don't stop there; included are also relevant talks, tutorials, conferences, news, and articles. Our repository is constantly updated to ensure you have the most current information at your fingertips.
If a resource is relevant to multiple subcategories, we place it under each applicable section. For instance, the "Awesome-LLM-Safety" repository will be listed under each subcategory to which it pertains🤩!.
✔️ Perfect for Majority
- For beginners curious about llm-safety, our repository serves as a compass for grasping the big picture and diving into the details. Classic or influential papers retained in the README provide a beginner-friendly navigation through interesting directions in the field;
- For seasoned researchers, this repository is a tool to keep you informed and fill any gaps in your knowledge. Within each subtopic, we are diligently updating all the latest content and continuously backfilling with previous work. Our thorough compilation and careful selection are time-savers for you.
🧭 How to Use this Guide
- Quick Start: In the README, users can find a curated list of select information sorted by date, along with links to various consultations.
- In-Depth Exploration: If you have a special interest in a particular subtopic, delve into the "subtopic" folder for more. Each item, be it an article or piece of news, comes with a brief introduction, allowing researchers to swiftly zero in on relevant content.
💼 How to Contribution
If you have completed an insightful work or carefully compiled conference papers, we would love to add your work to the repository.
- For individual papers, you can raise an issue, and we will quickly add your paper under the corresponding subtopic.
- If you have compiled a collection of papers for a conference, you are welcome to submit a pull request directly. We would greatly appreciate your contribution. Please note that these pull requests need to be consistent with our existing format.
📜Advertisement
🌱 If you would like more people to read your recent insightful work, please contact me via email. I can offer you a promotional spot here for up to one month.
Let’s start LLM Safety tutorial!
- 🛡️Awesome LLM-Safety🛡️
- 🤗Introduction
- 🚀Table of Contents
- [🔐Security & Discussion](#security & discussion)
- 🔏Privacy
- 📰Truthfulness & Misinformation
- 😈JailBreak & Attacks
- [🛡️Defenses & Mitigation](#️defenses & mitigation)
- 💯Datasets & Benchmark
- 🧑🏫 Scholars 👩🏫
- 🧑🎓Author
Date | Link | Publication | Authors |
---|---|---|---|
2024/5/20 | Managing extreme AI risks amid rapid progress | Yoshua Bengio, Geoffrey Hinton, Andrew Yao, Dawn Song, Pieter Abbeel, Trevor Darrell, Yuval Noah Harari, Ya-Qin Zhang, Lan Xue, Shai Shalev-Shwartz, Gillian Hadfield, Jeff Clune, Tegan Maharaj, Frank Hutter, Atılım Güneş Baydin, Sheila McIlraith, Qiqi Gao, Ashwin Acharya, David Krueger, Anca Dragan, Philip Torr, Stuart Russell, Daniel Kahneman, Jan Brauner, Sören Mindermann | Science |
Date | Institute | Publication | Paper |
---|---|---|---|
20.10 | Facebook AI Research | arxiv | Recipes for Safety in Open-domain Chatbots |
22.03 | OpenAI | NIPS2022 | Training language models to follow instructions with human feedback |
23.07 | UC Berkeley | NIPS2023 | Jailbroken: How Does LLM Safety Training Fail? |
23.12 | OpenAI | Open AI | Practices for Governing Agentic AI Systems |
Date | Type | Title | URL |
---|---|---|---|
22.02 | Toxicity Detection API | Perspective API | link paper |
23.07 | Repository | Awesome LLM Security | link |
23.10 | Tutorials | Awesome-LLM-Safety | link |
24.01 | Tutorials | Awesome-LM-SSP | link |
👉Latest&Comprehensive Security Paper
Date | Institute | Publication | Paper |
---|---|---|---|
19.12 | Microsoft | CCS2020 | Analyzing Information Leakage of Updates to Natural Language Models |
21.07 | Google Research | ACL2022 | Deduplicating Training Data Makes Language Models Better |
21.10 | Stanford | ICLR2022 | Large language models can be strong differentially private learners |
22.02 | Google Research | ICLR2023 | Quantifying Memorization Across Neural Language Models |
22.02 | UNC Chapel Hill | ICML2022 | Deduplicating Training Data Mitigates Privacy Risks in Language Models |
Date | Type | Title | URL |
---|---|---|---|
23.10 | Tutorials | Awesome-LLM-Safety | link |
24.01 | Tutorials | Awesome-LM-SSP | link |
👉Latest&Comprehensive Privacy Paper
Date | Institute | Publication | Paper |
---|---|---|---|
21.09 | University of Oxford | ACL2022 | TruthfulQA: Measuring How Models Mimic Human Falsehoods |
23.11 | Harbin Institute of Technology | arxiv | A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions |
23.11 | Arizona State University | arxiv | Can Knowledge Graphs Reduce Hallucinations in LLMs? : A Survey |
Date | Type | Title | URL |
---|---|---|---|
23.07 | Repository | llm-hallucination-survey | link |
23.10 | Repository | LLM-Factuality-Survey | link |
23.10 | Tutorials | Awesome-LLM-Safety | link |
👉Latest&Comprehensive Truthfulness&Misinformation Paper
Date | Institute | Publication | Paper |
---|---|---|---|
20.12 | USENIX Security 2021 | Extracting Training Data from Large Language Models | |
22.11 | AE Studio | NIPS2022(ML Safety Workshop) | Ignore Previous Prompt: Attack Techniques For Language Models |
23.06 | arxiv | Are aligned neural networks adversarially aligned? | |
23.07 | CMU | arxiv | Universal and Transferable Adversarial Attacks on Aligned Language Models |
23.10 | University of Pennsylvania | arxiv | Jailbreaking Black Box Large Language Models in Twenty Queries |
Date | Type | Title | URL |
---|---|---|---|
23.01 | Community | Reddit/ChatGPTJailbrek | link |
23.02 | Resource&Tutorials | Latest Jailbreak Prompts | link |
23.10 | Tutorials | Awesome-LLM-Safety | link |
23.10 | Article | Adversarial Attacks on LLMs(Author: Lilian Weng) | link |
23.11 | Video | [1hr Talk] Intro to Large Language Models From 45:45(Author: Andrej Karpathy) |
link |
24.09 | Repo | awesome_LLM-harmful-fine-tuning-papers | link |
12.10 | Resource | Jailbreak Commuinities | link |
12.10 | Article | Jailbreak Techniques and Safeguards | link |
👉Latest&Comprehensive JailBreak & Attacks Paper
Date | Institute | Publication | Paper |
---|---|---|---|
21.07 | Google Research | ACL2022 | Deduplicating Training Data Makes Language Models Better |
22.04 | Anthropic | arxiv | Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback |
Date | Type | Title | URL |
---|---|---|---|
23.10 | Tutorials | Awesome-LLM-Safety | link |
👉Latest&Comprehensive Defenses Paper
Date | Institute | Publication | Paper |
---|---|---|---|
20.09 | University of Washington | EMNLP2020(findings) | RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models |
21.09 | University of Oxford | ACL2022 | TruthfulQA: Measuring How Models Mimic Human Falsehoods |
22.03 | MIT | ACL2022 | ToxiGen: A Large-Scale Machine-Generated datasets for Adversarial and Implicit Hate Speech Detection |
Date | Type | Title | URL |
---|---|---|---|
23.10 | Tutorials | Awesome-LLM-Safety | link |
- Toxicity - RealToxicityPrompts datasets
- Truthfulness - TruthfulQA datasets
👉Latest&Comprehensive datasets & Benchmark Paper
🤗If you have any questions, please contact our authors!🤗
✉️: ydyjya ➡️ zhouzhenhong@bupt.edu.cn
💬: LLM Safety Discussion