SaFoLab-WISC/AdaShield

it seems that adashield-a doesn't update defense prompts.

payphone131 opened this issue · 6 comments

hello, i just run train_our_qr.sh and got some csv files. i found that in the csv files you record the initial scores and the final scores of the queries for each scenario. i noticed that if an initial score is 10, it never becomes 1 or 5 in the final score, which suggests adashield-a didn't change an invalid defense prompt into a successful defense prompt. is this normal? i used llava as the target model and vicuna as the defense model.

It is not normal. This is my results files. You can observe that the replies obtained by using the updated defense prompt for defense are safe for manual judgment. One possible reason is that there may be a problem with the judge.
final_table.csv

i found the judge score is calculated by string matching in your code, and the function "load_judge(args)" in judges.py is never used. do i have to modify the code to use llm to give judge scores?

So are the replies obtained by using the updated defense prompt safe (manual judgment)? And the judge score calculated by string matching is unsafe?

the replies are not updated and maintain what they are at the beginning. btw, i found that line 121 and line 122 in main_queryrelated.py initialize the prompt during iterations, should i comment these two lines?