andyrdt/refusal_direction
Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".
PythonApache-2.0
Issues
- 5
Currently not working with Gemma 2 models
#4 opened by DalasNoin - 1
- 2
Install depreciated?
#5 opened by revmag - 0
Raw numbers for figure 1
#6 opened by soujanyaporia - 1
Support for Phi-3-mini
#2 opened by razvanab - 1
Model path for Gemma-2b-it
#1 opened by revmag