RUCAIBox/TextBox

Pre-traininng from scratch

nlpmc opened this issue · 8 comments

nlpmc commented

Thanks for open-sourcing this exciting tool!
When I used TextBox for pre-training the BART from scratch, I found that the corpus mentioned in the document wudao has not been provided. Where can I get this data?

python run_textbox.py --model=BART --dataset=wudao --pretrain_task=denoising

Since I did not have the wudao dataset, I try to use the example dataset samsum for pre-training a BART using the denoising task.
However, I got the following error:

05 Jan 23:25    INFO ====== Start training ======
train    1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 77/77 [03:16<00:00,  2.55s/step, loss=2.78]
05 Jan 23:29    INFO Train epoch  1 [time: 196.60s, loss: 2.78]
generating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 52/52 [01:22<00:00,  1.58s/it]
05 Jan 23:30    ERROR Traceback (most recent call last):
  File "/mnt/windata/projects/TextBox/textbox/utils/dashboard.py", line 312, in new_experiment
    yield True
  File "/mnt/windata/projects/TextBox/textbox/quick_start/experiment.py", line 136, in run
    self._do_train_and_valid()
  File "/mnt/windata/projects/TextBox/textbox/quick_start/experiment.py", line 111, in _do_train_and_valid
    self.valid_result = self.trainer.fit(train_data, valid_data)
  File "/mnt/windata/projects/TextBox/textbox/trainer/trainer.py", line 455, in fit
    self.stopped |= self._valid(valid_data, 'epoch')
  File "/home/xxx/anaconda3/envs/TextBox/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/windata/projects/TextBox/textbox/trainer/trainer.py", line 294, in _valid
    valid_results = self.evaluate(valid_data, is_valid=True)
  File "/home/xxx/anaconda3/envs/TextBox/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/windata/projects/TextBox/textbox/trainer/trainer.py", line 548, in evaluate
    corpus_len = len(eval_data.dataset.target_text)
AttributeError: 'AbstractDataset' object has no attribute 'target_text'

Would you pls help me to find the mistake of using TextBox? Thanks!

nlpmc commented

If the wudao cannot be published, could you share the file schema of wudao and I can make a similar dataset file using my own data.

The dataset can be found at the link: https://resource.wudaoai.cn/home.

And the runing script need to be added with --do_test=False. Thanks for your reporting!

nlpmc commented

Thanks for the quick reply!

Do you mean --do_valid=False? I have tried do_test and the same error was raised. This script will remove the validation operation during the training and how to know the training progress (e.g. PPL)? And I also found that adding this script will stop the trainer from saving ckpt every epoch.

The samsum is a summarization dataset that contains the document as src and the summary as tgt. However, the pre-training language model only uses plain text as input. When we use a plain text corpus like wudao, how to construct the dataset file? Using the same text as both of the src and tgt?

First, you should pull the latest repository of our TextBox.
Then, you can read the instructions of pre-training. We have updated it to solve your concerns.
Thanks for your questions.

nlpmc commented

Nice instructions! Thanks for your help!

nlpmc commented

After pulling the latest version, I found a new error when training any generation task. This issue may be caused by the 9765c73.

Traceback (most recent call last):
  File "run_textbox.py", line 12, in <module>
    run_textbox(model=args.model, dataset=args.dataset, config_file_list=args.config_files, config_dict={})
  File "/mnt/windata/projects/TextBox/textbox/quick_start/quick_start.py", line 20, in run_textbox
    experiment = Experiment(model, dataset, config_file_list, config_dict)
  File "/mnt/windata/projects/TextBox/textbox/quick_start/experiment.py", line 39, in __init__
    self.config = Config(model, dataset, config_file_list, config_dict)
  File "/mnt/windata/projects/TextBox/textbox/config/configurator.py", line 72, in __init__
    self._set_associated_parameters()
  File "/mnt/windata/projects/TextBox/textbox/config/configurator.py", line 317, in _set_associated_parameters
    if self.final_config_dict['pretrain_task']:
KeyError: 'pretrain_task'

We have fixed that in the latest pr. Thanks for your reporting!

nlpmc commented

Thanks for the quick fix! 👍