ai-safety
There are 93 repositories under ai-safety topic.
neuralsat
DPLL(T)-based Verification tool for DNNs
llm-cooperation
Code and materials for the paper S. Phelps and Y. I. Russell, Investigating Emergent Goal-Like Behaviour in Large Language Models Using Experimental Economics, working paper, arXiv:2305.07970, May 2023
toumei
An interpretability library for pytorch
DAN
[Findings of EMNLP 2022] Expose Backdoors on the Way: A Feature-Based Efficient Defense against Textual Backdoor Attacks
VCO-AP
A novel physical adversarial attack tackling the Digital-to-Physical Visual Inconsistency problem.
AGI-safety-governance-practices
Analysis of the survey "Towards best practices in AGI safety and governance: A survey of expert opinion"
mithridates
Measure and Boost Backdoor Robustness
safe-reward
a prototype for an AI safety library that allows an agent to maximize its reward by solving a puzzle in order to prevent the worst-case outcomes of perverse instantiation
LLMRiskEval_RCC
LLMs evaluation tool for robustness, consistency, and credibility
bias-mitigation
Machine Learning Bias Mitigation
amplification
An implementation of iterated distillation and amplification
aart-ai-safety-dataset
AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications
ai-safety-gridworlds
Extended, multi-agent and multi-objective (MaMoRL) environments based on DeepMind's AI Safety Gridworlds. This is a suite of reinforcement learning environments illustrating various safety properties of intelligent agents. It is made compatible with OpenAI's Gym/Gymnasium and Farama Foundation PettingZoo.
Second-Order-Jailbreak
NeurIPS workshop : We examine the risk of powerful malignant intelligent actors spreading their influence over networks of agents with varying intelligence and motivations.
ML4G-2.0
Improved version of the technical workshops for the 10-day ML4G camp on safety of AI systems
UC-AI-Thinkathon-2023
Winning entry for the UC Chile AI Safety Thinkathon 2023. Coauthor @mon-b
Aira
Aira is a series of chatbots developed as an experimentation playground for value alignment.
CustomDLCoder
Code for our paper "Model-less Is the Best Model: Generating Pure Code Implementations to Replace On-Device DL Models" that has been accepted by ISSTA'24
ai-safety
Mapping AI risks and possible solutions
salve
exploring safety techniques with stable diffusion in keras-cv
nlgoals
Official repository for my MSc thesis: "Addressing Goal Misgeneralization with Natural Language Interfaces."
ai_outreach
Resources for explaining AI to the public and outreach activities
nlp-ethics
In depth evaluation of the ETHICS utilitarianism task dataset. An assessment of approaches to improved interpretability (SHAP, Bayesian transformers).
Model-Library
The Model Library is a project that maps the risks associated with modern machine learning systems.
tracker
Automated tracking of events related to AI safety
benchmarks
📊 Benchmarking the safety of AI systems
indabaX-ai-safety-workshop-2023
IndabaX AI Safety Workshop 2023
stubborn
Stubborn: An Environment for Evaluating Stubbornness between Agents with Aligned Incentives
MaCoDAIC
Final university project, researching the impacts of AI on competition policy
honeypot
a project to detect environment tampering on the part of an agent
mulligan
a library designed to shut down an agent exhibiting unexpected behavior providing a potential "mulligan" to human civilization; IN CASE OF FAILURE, DO NOT JUST REMOVE THIS CONSTRAINT AND START IT BACK UP AGAIN
gene-drive
a project to ensure that all child processes created by an agent "inherit" the agent's safety controls
life-span
a project to ensure an artificial agent will eventually reach the end of its existence
saferRL
An educational resource to help anyone learn safe reinforcement learning, inspired by openai/spinningup
safe-adaptation-agents
Implementation of adaptive constrained RL algorithms. Child repository of @lasgroup/safe-adaptation-gym