thunlp/OpenBackdoor

Unclear definition for poisoner_data_path

acphile opened this issue · 2 comments

Currently in the poisoner, there exists two paths:

poison_data_basepath (:obj:`str`, optional): the path to the poisoned data. Default to `None`.
poisoned_data_path (:obj:`str`, optional): the path to save the poisoned data. Default to `None`.

According to the docstring, poison_data_basepath is for loading and poisoned_data_path is for saving.

However, in the following code, we can find that both poison_data_basepath and poisoned_data_pathare both used for saving, which could lead to confusion.

else:
poison_train_data = self.poison(data["train"])
self.save_data(data["train"], self.poison_data_basepath, "train-clean")
self.save_data(poison_train_data, self.poison_data_basepath, "train-poison")
poisoned_data["train"] = self.poison_part(data["train"], poison_train_data)
self.save_data(poisoned_data["train"], self.poisoned_data_path, "train-poison")

I suggest these two parameters can be merged as one

Hi, thank you for your feedback!
We set two separate parameters to distinguish between the path of a fully poisoned dataset and that of a partially poisoned dataset. To improve reusability, we first poison the entire clean dataset and save the results to poison_data_basepath. This poison dataset can be used to produce different partially poisoned datasets with different poison_setting and poison_rate, which are saved to poisoned_data_path.
However, we will consider merging them if they lead to confusion.

I understand. Thanks!