tau-nlp/scrolls

Using custom dataset for scrolls

Closed this issue · 2 comments

I want to use a custom dataset to finetune the models. The dataset is identical to the gov_report dataset except that the input is filtered by a content selection algorithm, meaning that the input in the custom dataset will only be part of the original input. Since there are commands that need to be run to prepare the dataset, I wonder what are the steps that I should do in order to run scrolls with the custom dataset? The dataset can be found here: https://huggingface.co/datasets/learn3r/gov_report_oreo

Hi @Leonard907,
For the provided commands, each dataset has a config file in https://github.com/tau-nlp/scrolls/tree/main/baselines/configs/datasets.

Specifically for the custom gov_report dataset, check out baselines/configs/datasets/gov_report.json.
Just change these settings (which are used for datasets.load function):

    "dataset_name": "tau/scrolls",
    "dataset_config_name": "gov_report",

to

    "dataset_name": "learn3r/gov_report_oreo",

Let me know if you have any more questions.

I tried and it works. Thank you very much!