ai-alignment
There are 28 repositories under ai-alignment topic.
MinghuiChen43/awesome-trustworthy-deep-learning
A curated list of trustworthy deep learning papers. Daily updating...
agencyenterprise/PromptInject
PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. 🏆 Best Paper Awards @ NeurIPS ML Safety Workshop 2022
tomekkorbak/pretraining-with-human-feedback
Code accompanying the paper Pretraining Language Models with Human Preferences
lets-make-safe-ai/make-safe-ai
How to Make Safe AI? Let's Discuss! 💡|💬|🙌|📚
Giskard-AI/awesome-ai-safety
📚 A curated list of papers & technical articles on AI Quality & Safety
EzgiKorkmaz/adversarial-reinforcement-learning
Reading list for adversarial perspective and robustness in deep reinforcement learning.
dit7ya/awesome-ai-alignment
A curated list of awesome resources for Artificial Intelligence Alignment research
wesg52/sparse-probing-paper
Sparse probing paper full code.
RLHFlow/Directional-Preference-Alignment
Directional Preference Alignment
riceissa/aiwatch
Website to track people, organizations, and products (tools, websites, etc.) in AI safety
liondw/Signal-Alignment
An initiative to create concise and widely shareable educational resources, infographics, and animated explainers on the latest contributions to the community AI alignment effort. Boosting the signal and moving the community towards finding and building solutions.
UCSC-VLAA/Sight-Beyond-Text
This repository includes the official implementation of our paper "Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics"
phelps-sg/llm-cooperation
Code and materials for the paper S. Phelps and Y. I. Russell, Investigating Emergent Goal-Like Behaviour in Large Language Models Using Experimental Economics, working paper, arXiv:2305.07970, May 2023
IQTLabs/daisybell
Scan your AI/ML models for problems before you put them into production.
ai-fail-safe/safe-reward
a prototype for an AI safety library that allows an agent to maximize its reward by solving a puzzle in order to prevent the worst-case outcomes of perverse instantiation
rmoehn/farlamp
IDA with RL and overseer failures
rmoehn/amplification
An implementation of iterated distillation and amplification
Dicklesworthstone/some_thoughts_on_ai_alignment
Some Thoughts on AI Alignment: Using AI to Control AI
rmoehn/jursey
Q&A system with reflection and automation, similar to Patchwork, Affable, Mosaic
ai-fail-safe/gene-drive
a project to ensure that all child processes created by an agent "inherit" the agent's safety controls
ai-fail-safe/honeypot
a project to detect environment tampering on the part of an agent
ai-fail-safe/life-span
a project to ensure an artificial agent will eventually reach the end of its existence
ai-fail-safe/mulligan
a library designed to shut down an agent exhibiting unexpected behavior providing a potential "mulligan" to human civilization; IN CASE OF FAILURE, DO NOT JUST REMOVE THIS CONSTRAINT AND START IT BACK UP AGAIN
EveryOneIsGross/areteCHAT
A persona chat based on the VIA Character Strengths. Reads emotional tone and summons appropriate virtue to respond.
EveryOneIsGross/sinewCHAT
sinewCHAT uses instanced chatbots to emulate neural nodes to enrich and generate positive weighted responses.
veeara282/alignment-jam-2024may
Code for our May 2024 AI security evaluation research sprint project
EveryOneIsGross/bbBOT
bbBOT is a felixble persona based branching binary sentiment chatbot.