dataset
clearloveclearlove opened this issue · 3 comments
When I perform a style attack on the offenseval dataset, I find that the poisoned samples are formatted cleanly, but the offenseval original dataset has a large number of @user symbols, which I think is unreasonable. In this case, there is a high probability that it is not the style of the text that is acting as a trigger, but rather that the poisoned samples do not have the @user symbols.
So I suggest whether offenseval should be preprocessed before the experiment, for example filtering out the @user symbols.
Other spam and toxic datasets have similar problems.
Thanks for your suggestion. We have updated the toxic datasets with the pre-processed ones.
Thanks for your suggestion. We have updated the toxic datasets with the pre-processed ones.
Thank you for your reply, how can I download these processed datasets? :)
The bash scripts for downloading datasets have been updated, so you can access the datasets by running the scripts.