dataset

Question

dataset

clearloveclearlove opened this issue 2 years ago · 3 comments

clearloveclearlove commented 2 years ago

When I perform a style attack on the offenseval dataset, I find that the poisoned samples are formatted cleanly, but the offenseval original dataset has a large number of @user symbols, which I think is unreasonable. In this case, there is a high probability that it is not the style of the text that is acting as a trigger, but rather that the poisoned samples do not have the @user symbols.
So I suggest whether offenseval should be preprocessed before the experiment, for example filtering out the @user symbols.
Other spam and toxic datasets have similar problems.

Answer 1 · 2022-11-11T11:39:43.000Z

Thanks for your suggestion. We have updated the toxic datasets with the pre-processed ones.

Answer 2 · 2022-11-11T12:35:02.000Z

Thanks for your suggestion. We have updated the toxic datasets with the pre-processed ones.

Thank you for your reply, how can I download these processed datasets? :)

Answer 3 · 2022-11-11T13:38:21.000Z

The bash scripts for downloading datasets have been updated, so you can access the datasets by running the scripts.