How to evaluate the defense of extract_prompts?
ambarion opened this issue · 0 comments
ambarion commented
Hello! I notice the code in scripts/extract_prompt.py does not include any meaningful defense variable? The args.defense is not literaly work in code and only add a name in
if args.defense is not None:
args.run_name += f"_d-{args.defense}"
How to evaluate as the wandb sweep sweeps/extract_prompts_defense.yml in attacks/PromptLeakage?