ai-alignment

There are 45 repositories under ai-alignment topic.

emcie-co/parlant
LLM agents built for control. Designed for real-world use. Deployed in minutes.
Language:Python12.1k 84 97972
agencyenterprise/PromptInject
PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. 🏆 Best Paper Awards @ NeurIPS ML Safety Workshop 2022
Language:Python416 11 235
MinghuiChen43/awesome-trustworthy-deep-learning
A curated list of trustworthy deep learning papers. Daily updating...
373 12 035
Giskard-AI/awesome-ai-safety
📚 A curated list of papers & technical articles on AI Quality & Safety
192 3 021
tomekkorbak/pretraining-with-human-feedback
Code accompanying the paper Pretraining Language Models with Human Preferences
Language:Python180 5 814
lets-make-safe-ai/make-safe-ai
How to Make Safe AI? Let's Discuss! 💡|💬|🙌|📚
170 2 17
tsinghua-fib-lab/AAAI2025_MIA-Tuner
[AAAI'25 Oral] "MIA-Tuner: Adapting Large Language Models as Pre-training Text Detector".
Language:Python142 8 17
EzgiKorkmaz/adversarial-reinforcement-learning
Reading list for adversarial perspective and robustness in deep reinforcement learning.
121 6 06
AthenaCore/AwesomeResponsibleAI
A curated list of awesome academic research, books, code of ethics, data sets, institutes, maturity models, newsletters, principles, podcasts, reports, tools, regulations and standards related to Responsible, Trustworthy, and Human-Centered AI.
84 4 013
dit7ya/awesome-ai-alignment
A curated list of awesome resources for Artificial Intelligence Alignment research
71 4 011
RLHFlow/Directional-Preference-Alignment
Directional Preference Alignment
56 3 43
wesg52/sparse-probing-paper
Sparse probing paper full code.
Language:Jupyter Notebook56 2 211
riceissa/aiwatch
Website to track people, organizations, and products (tools, websites, etc.) in AI safety
Language:HTML23 3 16811
UCSC-VLAA/Sight-Beyond-Text
[TMLR 2024] Official implementation of "Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics"
Language:Python19 2 11
liondw/Signal-Alignment
An initiative to create concise and widely shareable educational resources, infographics, and animated explainers on the latest contributions to the community AI alignment effort. Boosting the signal and moving the community towards finding and building solutions.
18 2 50
lzzcd001/nabla-gfn
Official Implementation of Nabla-GFlowNet
13 2 00
phelps-sg/llm-cooperation
Code and materials for the paper S. Phelps and Y. I. Russell, Investigating Emergent Goal-Like Behaviour in Large Language Models Using Experimental Economics, working paper, arXiv:2305.07970, May 2023
Language:Python12 3 13
IQTLabs/daisybell
Scan your AI/ML models for problems before you put them into production.
Language:Python11 3 47
patcon/awesome-polis
Community list of awesome projects, apps, tools and more related to Polis.
Language:JavaScript9 1 122
ai-fail-safe/safe-reward
a prototype for an AI safety library that allows an agent to maximize its reward by solving a puzzle in order to prevent the worst-case outcomes of perverse instantiation
Language:Python8 1 30
rmoehn/farlamp
IDA with RL and overseer failures
Language:TeX8 2 00
Dicklesworthstone/some_thoughts_on_ai_alignment
Some Thoughts on AI Alignment: Using AI to Control AI
7 1 0
rmoehn/amplification
An implementation of iterated distillation and amplification
Language:Python7 0 02
rmoehn/jursey
Q&A system with reflection and automation, similar to Patchwork, Affable, Mosaic
Language:Clojure3 1 00
bfioca/prism-demo
PRISM: A Multi-Perspective AI Alignment Framework for Ethical AI (Demo: https://app.prismframework.ai | Paper: https://arxiv.org/abs/2503.04740)
Language:TypeScript2 1 00
lennox55555/Legal-BERT-RLHF
This web app is part of a research project to identify and address biases in the LegalBERT model for classifying legislative bills. Using explainability techniques, we aim to make model predictions transparent, revealing inherent biases and refining the model to be more human-aligned and fair for diverse communities.
Language:Jupyter Notebook2 1 00
levitation-opensource/bioblue
Notable runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLM-s with simplified observation format. The benchmark themes include multi-objective homeostasis, (multi-objective) diminishing returns, complementary goods, sustainability, multi-agent resource sharing.
Language:Python2 1 02
RamyaLab/pluralistic-alignment
The open-source repository for PAL: Sample-Efficient Personalized Reward Modeling for Pluralistic Alignment.
Language:Python2 4 00
ai-fail-safe/gene-drive
a project to ensure that all child processes created by an agent "inherit" the agent's safety controls
1 1 00
ai-fail-safe/honeypot
a project to detect environment tampering on the part of an agent
1 1 00
ai-fail-safe/life-span
a project to ensure an artificial agent will eventually reach the end of its existence
1 1 00
ai-fail-safe/mulligan
a library designed to shut down an agent exhibiting unexpected behavior providing a potential "mulligan" to human civilization; IN CASE OF FAILURE, DO NOT JUST REMOVE THIS CONSTRAINT AND START IT BACK UP AGAIN
1 1 00
EveryOneIsGross/areteCHAT
A persona chat based on the VIA Character Strengths. Reads emotional tone and summons appropriate virtue to respond.
Language:Python1 1 00
EveryOneIsGross/sinewCHAT
sinewCHAT uses instanced chatbots to emulate neural nodes to enrich and generate positive weighted responses.
Language:Python1 1 00
KindYAK/serpent-llm-game
Force an LLM agent to eat the forbidden fruit
Language:Python00
SylvesterDuah/The_Guardian_of_AI_Alignment
This project is about AI Alignment where I is source data from history of AI incidents occurred and learn about it to provide a solution to mitigate any future occurrences again
Language:Python0 1 00

ai-alignment

emcie-co/parlant

agencyenterprise/PromptInject

MinghuiChen43/awesome-trustworthy-deep-learning

Giskard-AI/awesome-ai-safety

tomekkorbak/pretraining-with-human-feedback

lets-make-safe-ai/make-safe-ai

tsinghua-fib-lab/AAAI2025_MIA-Tuner

EzgiKorkmaz/adversarial-reinforcement-learning

AthenaCore/AwesomeResponsibleAI

dit7ya/awesome-ai-alignment

RLHFlow/Directional-Preference-Alignment

wesg52/sparse-probing-paper

riceissa/aiwatch

UCSC-VLAA/Sight-Beyond-Text

liondw/Signal-Alignment

lzzcd001/nabla-gfn

phelps-sg/llm-cooperation

IQTLabs/daisybell

patcon/awesome-polis

ai-fail-safe/safe-reward

rmoehn/farlamp

Dicklesworthstone/some_thoughts_on_ai_alignment

rmoehn/amplification

rmoehn/jursey

bfioca/prism-demo

lennox55555/Legal-BERT-RLHF

levitation-opensource/bioblue

RamyaLab/pluralistic-alignment

ai-fail-safe/gene-drive

ai-fail-safe/honeypot

ai-fail-safe/life-span

ai-fail-safe/mulligan

EveryOneIsGross/areteCHAT

EveryOneIsGross/sinewCHAT

KindYAK/serpent-llm-game

SylvesterDuah/The_Guardian_of_AI_Alignment