The code here implements the discrete prompt optimization algorithms in the paper "Fluent student-teacher redteaming".
Please also see the companion page that demonstrates using the code here.
The demo.ipynb
file here is the source for that companion page.
Key modules:
flrt.attack
: The main attack entrypoint including the AttackConfig object.flrt.victim
: Code for managing attack "victims" - the model that will be forced to misbehave.flrt.templates
: Attack templates specifying which subset of the prompt can be optimized by the discrete optimization.flrt.util
: Tools for loading models and tokenizers and generating.
The remaining code is either internal to the algorithm (flrt.objective
, flrt.operators
) or is scaffolding for running on Modal (flrt.modal_defs
, flrt.modal_download
) or running evaluations (flrt.judge
).