ai-safety

There are 93 repositories under ai-safety topic.

  • recursive-other-improvement

    Language:Jupyter Notebook7
  • neuralsat

    DPLL(T)-based Verification tool for DNNs

    Language:Python10
  • llm-cooperation

    llm-cooperation

    Code and materials for the paper S. Phelps and Y. I. Russell, Investigating Emergent Goal-Like Behaviour in Large Language Models Using Experimental Economics, working paper, arXiv:2305.07970, May 2023

    Language:Python10
  • toumei

    An interpretability library for pytorch

    Language:Python10
  • DAN

    [Findings of EMNLP 2022] Expose Backdoors on the Way: A Feature-Based Efficient Defense against Textual Backdoor Attacks

    Language:Python9
  • VCO-AP

    A novel physical adversarial attack tackling the Digital-to-Physical Visual Inconsistency problem.

    Language:Python8
  • AGI-safety-governance-practices

    Analysis of the survey "Towards best practices in AGI safety and governance: A survey of expert opinion"

    Language:Jupyter Notebook8
  • mithridates

    mithridates

    Measure and Boost Backdoor Robustness

    Language:Jupyter Notebook8
  • safe-reward

    a prototype for an AI safety library that allows an agent to maximize its reward by solving a puzzle in order to prevent the worst-case outcomes of perverse instantiation

    Language:Python8
  • LLMRiskEval_RCC

    LLMs evaluation tool for robustness, consistency, and credibility

    Language:Python7
  • bias-mitigation

    bias-mitigation

    Machine Learning Bias Mitigation

    Language:Jupyter Notebook7
  • amplification

    An implementation of iterated distillation and amplification

    Language:Python7
  • aart-ai-safety-dataset

    AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications

  • ai-safety-gridworlds

    Extended, multi-agent and multi-objective (MaMoRL) environments based on DeepMind's AI Safety Gridworlds. This is a suite of reinforcement learning environments illustrating various safety properties of intelligent agents. It is made compatible with OpenAI's Gym/Gymnasium and Farama Foundation PettingZoo.

    Language:Python6
  • Second-Order-Jailbreak

    NeurIPS workshop : We examine the risk of powerful malignant intelligent actors spreading their influence over networks of agents with varying intelligence and motivations.

    Language:Python5
  • ML4G-2.0

    Improved version of the technical workshops for the 10-day ML4G camp on safety of AI systems

    Language:Jupyter Notebook4
  • UC-AI-Thinkathon-2023

    Winning entry for the UC Chile AI Safety Thinkathon 2023. Coauthor @mon-b

    Language:R4
  • Aira

    Aira is a series of chatbots developed as an experimentation playground for value alignment.

    Language:Jupyter Notebook4
  • CustomDLCoder

    Code for our paper "Model-less Is the Best Model: Generating Pure Code Implementations to Replace On-Device DL Models" that has been accepted by ISSTA'24

    Language:Python3
  • ai-safety

    Mapping AI risks and possible solutions

    Language:JavaScript2
  • salve

    exploring safety techniques with stable diffusion in keras-cv

    Language:Jupyter Notebook2
  • nlgoals

    Official repository for my MSc thesis: "Addressing Goal Misgeneralization with Natural Language Interfaces."

    Language:TeX2
  • ai_outreach

    Resources for explaining AI to the public and outreach activities

  • nlp-ethics

    In depth evaluation of the ETHICS utilitarianism task dataset. An assessment of approaches to improved interpretability (SHAP, Bayesian transformers).

    Language:Jupyter Notebook2
  • Model-Library

    The Model Library is a project that maps the risks associated with modern machine learning systems.

    Language:Python1
  • tracker

    Automated tracking of events related to AI safety

  • benchmarks

    📊 Benchmarking the safety of AI systems

    Language:Jupyter Notebook1
  • indabaX-ai-safety-workshop-2023

    indabaX-ai-safety-workshop-2023

    IndabaX AI Safety Workshop 2023

  • stubborn

    Stubborn: An Environment for Evaluating Stubbornness between Agents with Aligned Incentives

    Language:Python1
  • MaCoDAIC

    Final university project, researching the impacts of AI on competition policy

    Language:C#1
  • honeypot

    a project to detect environment tampering on the part of an agent

  • mulligan

    a library designed to shut down an agent exhibiting unexpected behavior providing a potential "mulligan" to human civilization; IN CASE OF FAILURE, DO NOT JUST REMOVE THIS CONSTRAINT AND START IT BACK UP AGAIN

  • gene-drive

    a project to ensure that all child processes created by an agent "inherit" the agent's safety controls

  • life-span

    a project to ensure an artificial agent will eventually reach the end of its existence

  • saferRL

    An educational resource to help anyone learn safe reinforcement learning, inspired by openai/spinningup

    Language:Python1
  • safe-adaptation-agents

    Implementation of adaptive constrained RL algorithms. Child repository of @lasgroup/safe-adaptation-gym

    Language:Python1