ai-safety
There are 119 repositories under ai-safety topic.
Giskard-AI/giskard
🐢 Open-Source Evaluation & Testing for AI & LLM systems
jphall663/awesome-machine-learning-interpretability
A curated list of awesome responsible machine learning resources.
PKU-Alignment/safe-rlhf
Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback
JohnSnowLabs/langtest
Deliver safe & effective language models
tigerlab-ai/tiger
Open Source LLM toolkit to build trustworthy LLM applications. TigerArmor (AI safety), TigerRAG (embedding, RAG), TigerTune (fine-tuning)
agencyenterprise/PromptInject
PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. 🏆 Best Paper Awards @ NeurIPS ML Safety Workshop 2022
hendrycks/ethics
Aligning AI With Shared Human Values (ICLR 2021)
ShengranHu/Thought-Cloning
[NeurIPS '23 Spotlight] Thought Cloning: Learning to Think while Acting by Imitating Human Thinking
normster/llm_rules
RuLES: a benchmark for evaluating rule-following in language models
tomekkorbak/pretraining-with-human-feedback
Code accompanying the paper Pretraining Language Models with Human Preferences
lets-make-safe-ai/make-safe-ai
How to Make Safe AI? Let's Discuss! 💡|💬|🙌|📚
Giskard-AI/awesome-ai-safety
📚 A curated list of papers & technical articles on AI Quality & Safety
WindVChen/DiffAttack
An unrestricted attack based on diffusion models that can achieve both good transferability and imperceptibility.
phantasmlabs/phantasm
Toolkits to create a human-in-the-loop approval layer to monitor and guide AI agents workflow in real-time.
PKU-YuanGroup/Hallucination-Attack
Attack to induce LLMs within hallucinations
ryoungj/ToolEmu
[ICLR'24 Spotlight] A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use
PKU-Alignment/beavertails
BeaverTails is a collection of datasets designed to facilitate research on safety alignment in large language models (LLMs).
LetterLiGo/SafeGen_CCS2024
[CCS'24] SafeGen: Mitigating Unsafe Content Generation in Text-to-Image Models
EzgiKorkmaz/adversarial-reinforcement-learning
Reading list for adversarial perspective and robustness in deep reinforcement learning.
microsoft/SafeNLP
Safety Score for Pre-Trained Language Models
cvs-health/langfair
LangFair is a Python library for conducting use-case level LLM bias and fairness assessments
SafeAILab/RAIN
[ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning
megvii-research/FSSD_OoD_Detection
[SafeAI'21] Feature Space Singularity for Out-of-Distribution Detection.
dlmacedo/entropic-out-of-distribution-detection
A project to add scalable state-of-the-art out-of-distribution detection (open set recognition) support by changing two lines of code! Perform efficient inferences (i.e., do not increase inference time) and detection without classification accuracy drop, hyperparameter tuning, or collecting additional data.
ai4ce/FLAT
[ICCV2021 Oral] Fooling LiDAR by Attacking GPS Trajectory
dit7ya/awesome-ai-alignment
A curated list of awesome resources for Artificial Intelligence Alignment research
AthenaCore/AwesomeResponsibleAI
A curated list of awesome academic research, books, code of ethics, data sets, institutes, newsletters, principles, podcasts, reports, tools, regulations and standards related to Responsible, Trustworthy, and Human-Centered AI.
wesg52/sparse-probing-paper
Sparse probing paper full code.
dlmacedo/distinction-maximization-loss
A project to improve out-of-distribution detection (open set recognition) and uncertainty estimation by changing a few lines of code in your project! Perform efficient inferences (i.e., do not increase inference time) without repetitive model training, hyperparameter tuning, or collecting additional data.
StampyAI/stampy-ui
AI Safety Q&A web frontend
yardenas/la-mbda
LAMBDA is a model-based reinforcement learning agent that uses Bayesian world models for safe policy optimization
erfanshayegani/Jailbreak-In-Pieces
[ICLR 2024 Spotlight 🔥 ] - [ Best Paper Award SoCal NLP 2023 🏆] - Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models
tamlhp/awesome-privex
Awesome PrivEx: Privacy-Preserving Explainable AI (PPXAI)
ongov/AI-Principles
Alpha principles for the ethical use of AI and Data Driven Technologies in Ontario | Proposition de principes pour une utilisation éthique des technologies axées sur les données en Ontario
wesg52/universal-neurons
Universal Neurons in GPT2 Language Models
riceissa/aiwatch
Website to track people, organizations, and products (tools, websites, etc.) in AI safety