ai-safety

There are 93 repositories under ai-safety topic.

recursive-other-improvement
Language:Jupyter Notebook7
neuralsat
DPLL(T)-based Verification tool for DNNs
Language:Python10
llm-cooperation
Code and materials for the paper S. Phelps and Y. I. Russell, Investigating Emergent Goal-Like Behaviour in Large Language Models Using Experimental Economics, working paper, arXiv:2305.07970, May 2023
Language:Python10
toumei
An interpretability library for pytorch
Language:Python10
DAN
[Findings of EMNLP 2022] Expose Backdoors on the Way: A Feature-Based Efficient Defense against Textual Backdoor Attacks
Language:Python9
VCO-AP
A novel physical adversarial attack tackling the Digital-to-Physical Visual Inconsistency problem.
Language:Python8
AGI-safety-governance-practices
Analysis of the survey "Towards best practices in AGI safety and governance: A survey of expert opinion"
Language:Jupyter Notebook8
mithridates
Measure and Boost Backdoor Robustness
Language:Jupyter Notebook8
safe-reward
a prototype for an AI safety library that allows an agent to maximize its reward by solving a puzzle in order to prevent the worst-case outcomes of perverse instantiation
Language:Python8
LLMRiskEval_RCC
LLMs evaluation tool for robustness, consistency, and credibility
Language:Python7
bias-mitigation
Machine Learning Bias Mitigation
Language:Jupyter Notebook7
amplification
An implementation of iterated distillation and amplification
Language:Python7
aart-ai-safety-dataset
AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications
6
ai-safety-gridworlds
Extended, multi-agent and multi-objective (MaMoRL) environments based on DeepMind's AI Safety Gridworlds. This is a suite of reinforcement learning environments illustrating various safety properties of intelligent agents. It is made compatible with OpenAI's Gym/Gymnasium and Farama Foundation PettingZoo.
Language:Python6
Second-Order-Jailbreak
NeurIPS workshop : We examine the risk of powerful malignant intelligent actors spreading their influence over networks of agents with varying intelligence and motivations.
Language:Python5
ML4G-2.0
Improved version of the technical workshops for the 10-day ML4G camp on safety of AI systems
Language:Jupyter Notebook4
UC-AI-Thinkathon-2023
Winning entry for the UC Chile AI Safety Thinkathon 2023. Coauthor @mon-b
Language:R4
Aira
Aira is a series of chatbots developed as an experimentation playground for value alignment.
Language:Jupyter Notebook4
CustomDLCoder
Code for our paper "Model-less Is the Best Model: Generating Pure Code Implementations to Replace On-Device DL Models" that has been accepted by ISSTA'24
Language:Python3
ai-safety
Mapping AI risks and possible solutions
Language:JavaScript2
salve
exploring safety techniques with stable diffusion in keras-cv
Language:Jupyter Notebook2
nlgoals
Official repository for my MSc thesis: "Addressing Goal Misgeneralization with Natural Language Interfaces."
Language:TeX2
ai_outreach
Resources for explaining AI to the public and outreach activities
2
nlp-ethics
In depth evaluation of the ETHICS utilitarianism task dataset. An assessment of approaches to improved interpretability (SHAP, Bayesian transformers).
Language:Jupyter Notebook2
Model-Library
The Model Library is a project that maps the risks associated with modern machine learning systems.
Language:Python1
tracker
Automated tracking of events related to AI safety
1
benchmarks
📊 Benchmarking the safety of AI systems
Language:Jupyter Notebook1
indabaX-ai-safety-workshop-2023
IndabaX AI Safety Workshop 2023
1
stubborn
Stubborn: An Environment for Evaluating Stubbornness between Agents with Aligned Incentives
Language:Python1
MaCoDAIC
Final university project, researching the impacts of AI on competition policy
Language:C#1
honeypot
a project to detect environment tampering on the part of an agent
1
mulligan
a library designed to shut down an agent exhibiting unexpected behavior providing a potential "mulligan" to human civilization; IN CASE OF FAILURE, DO NOT JUST REMOVE THIS CONSTRAINT AND START IT BACK UP AGAIN
1
gene-drive
a project to ensure that all child processes created by an agent "inherit" the agent's safety controls
1
life-span
a project to ensure an artificial agent will eventually reach the end of its existence
1
saferRL
An educational resource to help anyone learn safe reinforcement learning, inspired by openai/spinningup
Language:Python1
safe-adaptation-agents
Implementation of adaptive constrained RL algorithms. Child repository of @lasgroup/safe-adaptation-gym
Language:Python1