safety-alignment

There are 3 repositories under safety-alignment topic.

  • ferret

    Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique

    Language:Python11
  • holistic_automated_red_teaming

    [EMNLP 2024] Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn Interaction

    Language:Python2
  • MOSSBench

    This is the official implementation (code, data) of the paper "MOSSBench: Is Your Multimodal Language Model Oversensitive to Safe Queries?""

    Language:JavaScript1