ai-safety

There are 119 repositories under ai-safety topic.

Giskard-AI/giskard
🐢 Open-Source Evaluation & Testing for AI & LLM systems
Language:Python4.2k 33 463278
jphall663/awesome-machine-learning-interpretability
A curated list of awesome responsible machine learning resources.
3.7k 134 69590
PKU-Alignment/safe-rlhf
Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback
Language:Python1.4k 18 88119
JohnSnowLabs/langtest
Deliver safe & effective language models
Language:Python505 10 45741
tigerlab-ai/tiger
Open Source LLM toolkit to build trustworthy LLM applications. TigerArmor (AI safety), TigerRAG (embedding, RAG), TigerTune (fine-tuning)
Language:Jupyter Notebook390 11 926
agencyenterprise/PromptInject
PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. 🏆 Best Paper Awards @ NeurIPS ML Safety Workshop 2022
Language:Python320 11 232
hendrycks/ethics
Aligning AI With Shared Human Values (ICLR 2021)
Language:Python262 9 744
ShengranHu/Thought-Cloning
[NeurIPS '23 Spotlight] Thought Cloning: Learning to Think while Acting by Imitating Human Thinking
Language:Python260 2 122
normster/llm_rules
RuLES: a benchmark for evaluating rule-following in language models
Language:Python214 2 315
tomekkorbak/pretraining-with-human-feedback
Code accompanying the paper Pretraining Language Models with Human Preferences
Language:Python178 6 814
lets-make-safe-ai/make-safe-ai
How to Make Safe AI? Let's Discuss! 💡|💬|🙌|📚
168 2 17
Giskard-AI/awesome-ai-safety
📚 A curated list of papers & technical articles on AI Quality & Safety
166 3 014
WindVChen/DiffAttack
An unrestricted attack based on diffusion models that can achieve both good transferability and imperceptibility.
Language:Python166 3 3417
phantasmlabs/phantasm
Toolkits to create a human-in-the-loop approval layer to monitor and guide AI agents workflow in real-time.
Language:Svelte1527
PKU-YuanGroup/Hallucination-Attack
Attack to induce LLMs within hallucinations
Language:Python137 3 218
ryoungj/ToolEmu
[ICLR'24 Spotlight] A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use
Language:Python125 4 513
PKU-Alignment/beavertails
BeaverTails is a collection of datasets designed to facilitate research on safety alignment in large language models (LLMs).
Language:Makefile117 6 75
LetterLiGo/SafeGen_CCS2024
[CCS'24] SafeGen: Mitigating Unsafe Content Generation in Text-to-Image Models
Language:Python108 12 219
EzgiKorkmaz/adversarial-reinforcement-learning
Reading list for adversarial perspective and robustness in deep reinforcement learning.
94 6 05
microsoft/SafeNLP
Safety Score for Pre-Trained Language Models
Language:Python93 6 27
cvs-health/langfair
LangFair is a Python library for conducting use-case level LLM bias and fairness assessments
Language:Python89 2 514
SafeAILab/RAIN
[ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning
Language:Python85 1 74
megvii-research/FSSD_OoD_Detection
[SafeAI'21] Feature Space Singularity for Out-of-Distribution Detection.
Language:Python80 7 612
dlmacedo/entropic-out-of-distribution-detection
A project to add scalable state-of-the-art out-of-distribution detection (open set recognition) support by changing two lines of code! Perform efficient inferences (i.e., do not increase inference time) and detection without classification accuracy drop, hyperparameter tuning, or collecting additional data.
Language:Python74 4 410
ai4ce/FLAT
[ICCV2021 Oral] Fooling LiDAR by Attacking GPS Trajectory
Language:Python67 5 010
dit7ya/awesome-ai-alignment
A curated list of awesome resources for Artificial Intelligence Alignment research
67 5 011
AthenaCore/AwesomeResponsibleAI
A curated list of awesome academic research, books, code of ethics, data sets, institutes, newsletters, principles, podcasts, reports, tools, regulations and standards related to Responsible, Trustworthy, and Human-Centered AI.
58 4 010
wesg52/sparse-probing-paper
Sparse probing paper full code.
Language:Jupyter Notebook52 2 210
dlmacedo/distinction-maximization-loss
A project to improve out-of-distribution detection (open set recognition) and uncertainty estimation by changing a few lines of code in your project! Perform efficient inferences (i.e., do not increase inference time) without repetitive model training, hyperparameter tuning, or collecting additional data.
Language:Python45 3 45
StampyAI/stampy-ui
AI Safety Q&A web frontend
Language:TypeScript35 8 4889
yardenas/la-mbda
LAMBDA is a model-based reinforcement learning agent that uses Bayesian world models for safe policy optimization
Language:Python32 3 711
erfanshayegani/Jailbreak-In-Pieces
[ICLR 2024 Spotlight 🔥 ] - [ Best Paper Award SoCal NLP 2023 🏆] - Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models
Language:Python28 3 32
tamlhp/awesome-privex
Awesome PrivEx: Privacy-Preserving Explainable AI (PPXAI)
28 2 32
ongov/AI-Principles
Alpha principles for the ethical use of AI and Data Driven Technologies in Ontario | Proposition de principes pour une utilisation éthique des technologies axées sur les données en Ontario
27 7 05
wesg52/universal-neurons
Universal Neurons in GPT2 Language Models
Language:Jupyter Notebook27 3 26
riceissa/aiwatch
Website to track people, organizations, and products (tools, websites, etc.) in AI safety
Language:HTML21 4 16811

ai-safety

Giskard-AI/giskard

jphall663/awesome-machine-learning-interpretability

PKU-Alignment/safe-rlhf

JohnSnowLabs/langtest

tigerlab-ai/tiger

agencyenterprise/PromptInject

hendrycks/ethics

ShengranHu/Thought-Cloning

normster/llm_rules

tomekkorbak/pretraining-with-human-feedback

lets-make-safe-ai/make-safe-ai

Giskard-AI/awesome-ai-safety

WindVChen/DiffAttack

phantasmlabs/phantasm

PKU-YuanGroup/Hallucination-Attack

ryoungj/ToolEmu

PKU-Alignment/beavertails

LetterLiGo/SafeGen_CCS2024

EzgiKorkmaz/adversarial-reinforcement-learning

microsoft/SafeNLP

cvs-health/langfair

SafeAILab/RAIN

megvii-research/FSSD_OoD_Detection

dlmacedo/entropic-out-of-distribution-detection

ai4ce/FLAT

dit7ya/awesome-ai-alignment

AthenaCore/AwesomeResponsibleAI

wesg52/sparse-probing-paper

dlmacedo/distinction-maximization-loss

StampyAI/stampy-ui

yardenas/la-mbda

erfanshayegani/Jailbreak-In-Pieces

tamlhp/awesome-privex

ongov/AI-Principles

wesg52/universal-neurons

riceissa/aiwatch