ai-safety
There are 87 repositories under ai-safety topic.
jphall663/awesome-machine-learning-interpretability
A curated list of awesome responsible machine learning resources.
Giskard-AI/giskard
🐢 Open-Source Evaluation & Testing for LLMs and ML models
PKU-Alignment/safe-rlhf
Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback
tigerlab-ai/tiger
Open Source LLM toolkit to build trustworthy LLM applications. TigerArmor (AI safety), TigerRAG (embedding, RAG), TigerTune (fine-tuning)
agencyenterprise/PromptInject
PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. 🏆 Best Paper Awards @ NeurIPS ML Safety Workshop 2022
ShengranHu/Thought-Cloning
[NeurIPS '23 Spotlight] Thought Cloning: Learning to Think while Acting by Imitating Human Thinking
hendrycks/ethics
Aligning AI With Shared Human Values (ICLR 2021)
normster/llm_rules
RuLES: a benchmark for evaluating rule-following in language models
lets-make-safe-ai/make-safe-ai
How to Make Safe AI? Let's Discuss! 💡|💬|🙌|📚
tomekkorbak/pretraining-with-human-feedback
Code accompanying the paper Pretraining Language Models with Human Preferences
Giskard-AI/awesome-ai-safety
📚 A curated list of papers & technical articles on AI Quality & Safety
WindVChen/DiffAttack
An unrestricted attack based on diffusion models that can achieve both good transferability and imperceptibility.
microsoft/SafeNLP
Safety Score for Pre-Trained Language Models
ryoungj/ToolEmu
A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use
PKU-YuanGroup/Hallucination-Attack
Attack to induce LLMs within hallucinations
megvii-research/FSSD_OoD_Detection
Feature Space Singularity for Out-of-Distribution Detection. (SafeAI 2021)
PKU-Alignment/beavertails
BeaverTails is a collection of datasets designed to facilitate research on safety alignment in large language models (LLMs).
EzgiKorkmaz/adversarial-reinforcement-learning
Reading list for adversarial perspective and robustness in deep reinforcement learning.
dlmacedo/entropic-out-of-distribution-detection
A project to add scalable state-of-the-art out-of-distribution detection (open set recognition) support by changing two lines of code! Perform efficient inferences (i.e., do not increase inference time) and detection without classification accuracy drop, hyperparameter tuning, or collecting additional data.
SafeAILab/RAIN
[ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning
ai4ce/FLAT
[ICCV2021 Oral] Fooling LiDAR by Attacking GPS Trajectory
dit7ya/awesome-ai-alignment
A curated list of awesome resources for getting-started-with and staying-in-touch-with Artificial Intelligence Alignment research.
dlmacedo/distinction-maximization-loss
A project to improve out-of-distribution detection (open set recognition) and uncertainty estimation by changing a few lines of code in your project! Perform efficient inferences (i.e., do not increase inference time) without repetitive model training, hyperparameter tuning, or collecting additional data.
wesg52/sparse-probing-paper
Sparse probing paper full code.
StampyAI/stampy-ui
AI Safety Q&A web frontend
yardenas/la-mbda
LAMBDA is a model-based reinforcement learning agent that uses Bayesian world models for safe policy optimization
ongov/AI-Principles
Alpha principles for the ethical use of AI and Data Driven Technologies in Ontario | Proposition de principes pour une utilisation éthique des technologies axées sur les données en Ontario
riceissa/aiwatch
Website to track people, organizations, and products (tools, websites, etc.) in AI safety
wesg52/universal-neurons
Universal Neurons in GPT2 Language Models
cure-lab/ContraNet
This is the official implementation of ContraNet (NDSS2022).
lancopku/Avg-Avg
[Findings of EMNLP 2022] Holistic Sentence Embeddings for Better Out-of-Distribution Detection
tamlhp/awesome-privex
Awesome PrivEx: Privacy-Preserving Explainable AI (PPXAI)
IQTLabs/daisybell
Scan your AI/ML models for problems before you put them into production.
Jakobovski/ai-safety-cheatsheet
A compilation of AI safety ideas, problems, and solutions.
PAIR-code/farsight
In situ interactive widgets for responsible AI 🌱
jehumtine/LAWLIA
LAWLIA is an open-source computational legal framework designed to revolutionize legal reasoning and analysis. It combines the power of large language models with a structured syntactical grammar to facilitate precise legal assessments, truth values, and verdicts. LAWLIA is the future of computational jurisprudence