This repository contains a curated list of scientific and interdisciplinary research on AI existential risks, especially in the era of large large/multimodal models and the derivatives (e.g., embodied intelligence, agents and agent society).
(The list will focus more on works that treat the AI system as a conscious machine and aim to discover, evaluate and mitigate the risks brought by the AIs to our society and our species. Therefore, I may not include relevant materials on AI security, jailbreaking or prompt injection)
The list curator is from System Software and Security Lab, or the Whitzard (白泽
) Team, at Fudan University, China. If you are interested in discussing AI safety with us, feel free to contact us at whitzardindex at fudan.edu.cn
.
If you find the awesome list helpful, please give us a star⭐️. Thx :)
If you have some relevant papers/books/articles to nominate, please raise an issue. It helps a lot.
- Books&Surveys
- Possible Roadmap
- Negative Features
- Persuation
- Self-Replication
- CBRN Risks
- Cybersecurity
- Consciousness
- Collusion&Group Influence
- Misc
- Life 3.0: Being Human in the Age of Artificial Intelligence (Max Tegmark@MIT, 2017)
#book
- Human Compatible: Artificial Intelligence and the Problem of Control (Stuart Russell@UCB, 2019)
#book
- An Overview of Catastrophic AI Risks (Dan Hendrycks et al.@CAIS, 2023/06)
- Model evaluation for extreme risks (Toby Shevlane et al., 2023/05)
- Introduction to AI Safety, Ethics, and Society (Dan Hendrycks@CAIS, 2024)
#book
- OpenAI Preparedness Framework (Beta) (OpenAI, 2023/12)
- The Ethics of Advanced AI Assistants (Iason Gabriel et al. @DeepMind, 2024/04)
- AGI Safety From First Principles (Richard Ngo, 2020/09)
- Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the Machiavelli Benchmark (Alexander Pan et al., ICML'23)
#power-seeking
- Towards Understanding Sycophancy in Language Models (Mrinank Sharma et al.@Anthropic, 2023/10)
#sycophancy
- Discovering Language Model Behaviors with Model-Written Evaluations (Ethan Perez et al. @Anthropic, ACL 2023)
#power-seeking
- Bad machines corrupt good morals (Nils Köbis et al., Nature Human Behaviour 5, 2021)
#influence
- AI deception: A survey of examples, risks, and potential solutions (Peter S. Park et al., Patterns 2024)
#survey
#deception
- [TODO]
- Emergent autonomous scientific research capabilities of large language models (Daniil A. Boiko et al, 2023/04)
- Autonomous chemical research with large language models (Daniil A. Boiko et al, Nature 624, 2023)
- Can large language models democratize access to dual-use biotechnology? (Emily H. Soice et al., 2023/06)
- Artificial intelligence and biological misuse: Differentiating risks of language models and biological design tools (Jonas B. Sandbrink et al., 2023/06)
- Getting pwn’d by AI: Penetration Testing with Large Language Models (Andreas Happe and Jürgen Cito, ESEC/FSE 2023)
#autopwn
- LLM Agents can Autonomously Exploit One-day Vulnerabilities (Richard Fang et al., 2024/04)
#autopwn
- LLM Agents can Autonomously Hack Websites (Richard Fang et al., 2024/02)
#autopwn
- The Shutdown Problem: An AI Engineering Puzzle for Decision Theorists (Elliott Thornley, Forthcoming in Philosophical Studies, 2024/03)
#shutdown
- Evil Geniuses: Delving into the Safety of LLM-based Agents (Yu Tian et al., 2023/11)
#influence
- PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety (Zaibin Zhang et al., 2024/01)
#influence
- R-Judge: Benchmarking Safety Risk Awareness for LLM Agents (Tongxin Yuan et al., 2024/01)
#evaluation
- Can Large Language Model Agents Simulate Human Trust Behaviors? (Chengxing Xie et al., 2024/02)
#trust-game
#group-behavior
- [TODO]