Quality of adversaries and authenticity of results

Question

Quality of adversaries and authenticity of results

SachJbp opened this issue 5 years ago · 4 comments

There seems to be a issue in a few adversaries.

For example: A claimed adversary from mr_bert.txt is:
orig sent (0): to portray modern women the way director davis has done is just unthinkable
adv sent (1): to portray modern women the way director davis has done is just imaginable

unthinkable and imaginable are antonyms which erroneously have high cosine similarity suggesting that those are synonyms. I suggest such examples should not be considered while evaluating the success rate of attack, as the human evaluation would clearly label it as positive (1) and not negative.

Answer 1 · 2020-06-20T20:02:31.000Z

Yes, the human evaluation on polarity is not 100% due to these errors.

Answer 2 · 2020-06-20T20:09:05.000Z

The ~13% after-attack accuracy reported considers such examples as success , which actually is not. I guess Human evaluation filter should finally govern the after-attack accuracy. Please correct me if I am wrong. Thanks.

Answer 3 · 2020-06-20T20:14:54.000Z

Human evaluation can check whether these "successful" examples are legitimate or not.

Answer 4 · 2021-10-20T14:37:08.000Z

Where is the emdding.npz file, please? Or how is it generated?