Quality of adversaries and authenticity of results
SachJbp opened this issue · 4 comments
There seems to be a issue in a few adversaries.
For example: A claimed adversary from mr_bert.txt is:
orig sent (0): to portray modern women the way director davis has done is just unthinkable
adv sent (1): to portray modern women the way director davis has done is just imaginable
unthinkable and imaginable are antonyms which erroneously have high cosine similarity suggesting that those are synonyms. I suggest such examples should not be considered while evaluating the success rate of attack, as the human evaluation would clearly label it as positive (1) and not negative.
Yes, the human evaluation on polarity is not 100% due to these errors.
The ~13% after-attack accuracy reported considers such examples as success , which actually is not. I guess Human evaluation filter should finally govern the after-attack accuracy. Please correct me if I am wrong. Thanks.