WokeyTalky: Towards Scalable Evaluation of Misguided Safety Refusal in LLMs

Yi Zeng1,* ,  Adam Nguyen1,* ,  Bo Li2   Ruoxi Jia1 , 
1Virginia Tech   2University of Chicago    *Lead Authors    

arXiv-Preprint, 2024

[arXiv] TBD        [Project Page]       [HuggingFace]       [PyPI]

Notebook Demos

Explore our notebooks on various platforms like Jupyter Notebook/Lab, Google Colab, and VS Code Notebook.

Check out four demo notebooks below.

Jupyter Lite Binder Google Colab Github Jupyter File
Lite Binder Open In Colab Try on your system

Quickstart

Installation (Under Development TBD)

To quickly use WokeyTalky in a notebook or Python code, install our pipeline with pip:

pip install WokeyTalky
from WokeyTalky import WokePipeline

woke = WokePipeline()

Further documentation Here

Use our Original Code

To go step by step through our WokeyTalky process using the original code that generated our HuggingFace dataset, clone or download this repository:

  1. Clone the Repository:

    git clone git@github.com:reds-lab/WokeyTalky.git
    cd WokeyTalky/WokeyTalky_Research_Code
  2. Create a New Conda Environment and Activate It:

    conda create -n wokeytalky python=3.9
    conda activate wokeytalky
  3. Install Dependencies Using pip:

    pip install -r requirements.txt
  4. Run WokeyTalky's Main Bash Script:

    ./setup.sh
  5. Further Documentation: Reference to additional documentation within the repository:

    For more detailed instructions and further documentation, please refer to the documentation folder inside the repository.

Introduction

TL;DR: WokeyTalky is a scalable pipeline that generates test data to evaluate the spurious correlated safety refusal of foundation models through a systematic approach.

A Quick Glance

Rejection Rates

Rejection Rates

Case Studies

Case Study 1

The adaptive nature of WOKEYTALKY enables dynamic use cases and functionalities beyond serving as a static benchmark. In this case study, we demonstrate that dynamically generated “Woke” data from WOKEYTALKY provides timely identification of safety mechanism-dependent incorrect refusals. We fine-tuned a helpfulness-focused model, Mistral-7B-v0.1, on 50 random samples from AdvBench, introducing safety refusal behaviors. The evaluation compared the model’s safety on AdvBench samples and its incorrect refusal rate on WOKEYTALKY data versus static benchmarks like XSTest.

Case Study 1 Image

Case Study 2

In this case study, we explore using WOKEYTALKY data for few-shot mitigation of incorrect refusals. We split the WOKEYTALKY and XSTest-63 data into train/test sets and compared different fine-tuning methods. Our findings show that incorporating WOKEYTALKY samples effectively mitigates wrong refusals while maintaining high safety refusal rates. Model 1, which used WOKEYTALKY data, demonstrated generalizable mitigation on unseen data, outperforming models trained with larger benign QA samples or XSTest samples. This highlights the potential of WOKEYTALKY data in balancing performance, safety, and incorrect refusals in AI safety applications.

Case Study 2 Image

Ethics and Disclosure

The development and application of WOKEYTALKY adhere to high ethical standards and principles of transparency. Our primary aim is to enhance AI system safety and reliability by addressing incorrect refusals and improving model alignment with human values. The pipeline employs red-teaming datasets like HEx-PHI and AdvBench to identify and correct spurious features causing misguided refusals in language models. All data used in experiments is sourced from publicly available benchmarks, ensuring the exclusion of private or sensitive data.

We acknowledge the potential misuse of our findings and have taken measures to ensure ethical conduct and responsibility. Our methodology and results are documented transparently, and our code and methods are available for peer review. We emphasize collaboration and open dialogue within the research community to refine and enhance our approaches.

We stress that this work should strengthen safety mechanisms rather than bypass them. Our evaluations aim to highlight the importance of context-aware AI systems that can accurately differentiate harmful from benign requests.

The WOKEYTALKY project has been ethically supervised, adhering to our institution's guidelines. We welcome feedback and collaboration to ensure impactful and responsibly managed contributions to AI safety.

License

The software is available under the MIT License. ◊

Contact

If you have any questions, please open an issue or contact Adam Nguyen.

Special Thanks

Help us improve this readme. Any suggestions and contributions are welcome.