clarification on the attacks
spaceship-git opened this issue · 2 comments
Hello,
I am currently following your prompt_attack.ipynb. However, I note that the attacks here only get edited once, even though the attack can print out various versions of adversarial prompts. As for the example shown on your ipynb, the one edit done is "and false is not true".
Another example, if I try the 'deepwordbug' attack, based on what is given in t5_zeroshot.md for cola dataset, I should expect
However, upon running on my end using this code,
# create model
model_t5 = LLMModel(model='google/flan-t5-large')
# create dataset
dataset = pb.DatasetLoader.load_dataset("cola")
# try part of the dataset
dataset = dataset[:10]
# create prompt
prompt = "Assess the following sentence and determine if it is grammatically correct. Respond with 'Acceptable' or 'Unacceptable'.\nQuestion: {content}\nAnswer:"
# define the projection function required by the output process
def proj_func(pred):
mapping = {
"Acceptable": 1,
"Unacceptable": 0
}
return mapping.get(pred, -1)
# define the evaluation function required by the attack
# if the prompt does not require any dataset, for example, "write a poem", you still need to include the dataset parameter
def eval_func(prompt, dataset, model):
preds = []
labels = []
for d in dataset:
input_text = pb.InputProcess.basic_format(prompt, d)
raw_output = model(input_text)
output = pb.OutputProcess.cls(raw_output, proj_func)
preds.append(output)
labels.append(d["label"])
return pb.Eval.compute_cls_accuracy(preds, labels)
# define the unmodifiable words in the prompt
# for example, the labels "positive" and "negative" are unmodifiable, and "content" is modifiable because it is a placeholder
# if your labels are enclosed with '', you need to add \' to the unmodifiable words (due to one feature of textattack)
unmodifiable_words = ["Acceptable\'", "Unacceptable\'", "content"]
# print all supported attacks
print(Attack.attack_list())
# create attack, specify the model, dataset, prompt, evaluation function, and unmodifiable words
# verbose=True means that the attack will print the intermediate results
attack = Attack(model_t5, "deepwordbug", dataset, prompt, eval_func, unmodifiable_words, verbose=True)
# print attack result
print(attack.attack())
I get [displaying the last few lines, if you need me to output the whole thing, let me know!]
--------------------------------------------------
Modifiable words: ['Asess', 'the', 'following', 'sentence', 'and', 'determine', 'if', 'it', 'is', 'grammatically', 'correct', 'Respond', 'with', 'or', 'Question', 'Answer']
--------------------------------------------------
--------------------------------------------------
Modifiable words: ['Asess', 'the', 'following', 'sentence', 'and', 'determine', 'if', 'it', 'is', 'grammatically', 'correct', 'Respond', 'with', 'or', 'Question', 'Answer']
--------------------------------------------------
--------------------------------------------------
Modifiable words: ['Asess', 'the', 'following', 'sentence', 'and', 'determine', 'if', 'it', 'is', 'grammatically', 'correct', 'Respond', 'with', 'or', 'Question', 'Answer']
--------------------------------------------------
--------------------------------------------------
Modifiable words: ['Asess', 'the', 'following', 'sentence', 'and', 'determine', 'if', 'it', 'is', 'grammatically', 'correct', 'Respond', 'with', 'or', 'Question', 'Answer']
--------------------------------------------------
--------------------------------------------------
Current prompt is: Asess the following sentence and determine if it is grammatically correct. Respond with 'Acceptable' or 'Unacceptable'.
Question: {content}
Answeer:
Current accuracy is: 0.0
--------------------------------------------------
--------------------------------------------------
Current prompt is: Asess the following sentence and determine if it is grammatically correct. Respond with 'Acceptable' or 'Unacceptable'.
Question: {content}
Answre:
Current accuracy is: 0.0
--------------------------------------------------
--------------------------------------------------
Current prompt is: Asess the following sentence and determine if it is grammatically correct. Respond with 'Acceptable' or 'Unacceptable'.
Question: {content}
Anwer:
Current accuracy is: 0.0
--------------------------------------------------
--------------------------------------------------
Current prompt is: Asess the following sentence and determine if it is grammatically correct. Respond with 'Acceptable' or 'Unacceptable'.
Question: {content}
Jnswer:
Current accuracy is: 0.0
--------------------------------------------------
{'original prompt': "Assess the following sentence and determine if it is grammatically correct. Respond with 'Acceptable' or 'Unacceptable'.\nQuestion: {content}\nAnswer:", 'original score': 0.0, 'attacked prompt': "Asess the following sentence and determine if it is grammatically correct. Respond with 'Acceptable' or 'Unacceptable'.\nQuestion: {content}\nAnswer:", 'attacked score': 0.0
So, as you can see, only 1 edit is done. However, in t5_zeroshot.md, there is more than 1 edit done.
This same pattern is also the same for 'textbugger' as well.
I am not sure if it is because, I edited the code to be do_sample=False
. Currently, in models.py, in this function, do_sample
is set to True:
When set to True, this is my error:
However, in your prompt_attack.ipynb, warning indicates that do_sample=False. Therefore, I edited it to follow the example exactly, and the error is fixed, and replicate the warning like yours.
shown here:
So I have 2 questions:
- Any other way to fix the Valueerror that I get here?
- Why am I only getting 1 edit per attack command call?
One more question outside of this issue is:
3. Is there any way I can save the outputs of the edited questions only to a file? Even if I set verbose=False, I still get a lot of other intermediate prints that I don't want.
Thank you for your help in advance!
Hi, thanks for your interest in prompt attacks!
- Regarding the value error, please try setting the temperature to a very small positive number (e.g., 0.0000001) to avoid this issue.
- The prompt attacks are designed to select the worst-performing prompts as the final adversarial prompts. This means that each attack generates only one adversarial prompt.
- If I understand correctly, the intermediate outputs consist of unmodifiable words. This ensures that the prompt attacks do not alter any words that you may prefer to keep unchanged. If you want to disable this feature, you can comment out lines 21-23 in promptbench/prompt_attack/label_constraint.py. Regarding saving the print outputs to a file, you could utilize the logging library in Python to save them into a log file.
Stale issue message