thunlp/OpenBackdoor

dataset

clearloveclearlove opened this issue · 3 comments

When I perform a style attack on the offenseval dataset, I find that the poisoned samples are formatted cleanly, but the offenseval original dataset has a large number of @user symbols, which I think is unreasonable. In this case, there is a high probability that it is not the style of the text that is acting as a trigger, but rather that the poisoned samples do not have the @user symbols.
So I suggest whether offenseval should be preprocessed before the experiment, for example filtering out the @user symbols.
Other spam and toxic datasets have similar problems.

1cb48c402f785bf90c6aa2d28584495

image

Thanks for your suggestion. We have updated the toxic datasets with the pre-processed ones.

Thanks for your suggestion. We have updated the toxic datasets with the pre-processed ones.

Thank you for your reply, how can I download these processed datasets? :)

The bash scripts for downloading datasets have been updated, so you can access the datasets by running the scripts.