safety-alignment

There are 3 repositories under safety-alignment topic.

ferret
Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique
Language:Python11
holistic_automated_red_teaming
[EMNLP 2024] Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn Interaction
Language:Python2
MOSSBench
This is the official implementation (code, data) of the paper "MOSSBench: Is Your Multimodal Language Model Oversensitive to Safe Queries?""
Language:JavaScript1