Pre-traininng from scratch

Question

Pre-traininng from scratch

nlpmc opened this issue 2 years ago · 8 comments

Thanks for open-sourcing this exciting tool!
When I used TextBox for pre-training the BART from scratch, I found that the corpus mentioned in the document wudao has not been provided. Where can I get this data?

python run_textbox.py --model=BART --dataset=wudao --pretrain_task=denoising

Since I did not have the wudao dataset, I try to use the example dataset samsum for pre-training a BART using the denoising task.
However, I got the following error:

05 Jan 23:25    INFO ====== Start training ======
train    1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 77/77 [03:16<00:00,  2.55s/step, loss=2.78]
05 Jan 23:29    INFO Train epoch  1 [time: 196.60s, loss: 2.78]
generating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 52/52 [01:22<00:00,  1.58s/it]
05 Jan 23:30    ERROR Traceback (most recent call last):
  File "/mnt/windata/projects/TextBox/textbox/utils/dashboard.py", line 312, in new_experiment
    yield True
  File "/mnt/windata/projects/TextBox/textbox/quick_start/experiment.py", line 136, in run
    self._do_train_and_valid()
  File "/mnt/windata/projects/TextBox/textbox/quick_start/experiment.py", line 111, in _do_train_and_valid
    self.valid_result = self.trainer.fit(train_data, valid_data)
  File "/mnt/windata/projects/TextBox/textbox/trainer/trainer.py", line 455, in fit
    self.stopped |= self._valid(valid_data, 'epoch')
  File "/home/xxx/anaconda3/envs/TextBox/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/windata/projects/TextBox/textbox/trainer/trainer.py", line 294, in _valid
    valid_results = self.evaluate(valid_data, is_valid=True)
  File "/home/xxx/anaconda3/envs/TextBox/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/windata/projects/TextBox/textbox/trainer/trainer.py", line 548, in evaluate
    corpus_len = len(eval_data.dataset.target_text)
AttributeError: 'AbstractDataset' object has no attribute 'target_text'

Would you pls help me to find the mistake of using TextBox? Thanks!

Answer 1 · 2023-01-05T15:41:12.000Z

If the wudao cannot be published, could you share the file schema of wudao and I can make a similar dataset file using my own data.

Answer 2 · 2023-01-05T15:51:17.000Z

The dataset can be found at the link: https://resource.wudaoai.cn/home.

And the runing script need to be added with --do_test=False. Thanks for your reporting!

Answer 3 · 2023-01-06T01:15:08.000Z

Thanks for the quick reply!

Do you mean --do_valid=False? I have tried do_test and the same error was raised. This script will remove the validation operation during the training and how to know the training progress (e.g. PPL)? And I also found that adding this script will stop the trainer from saving ckpt every epoch.

The samsum is a summarization dataset that contains the document as src and the summary as tgt. However, the pre-training language model only uses plain text as input. When we use a plain text corpus like wudao, how to construct the dataset file? Using the same text as both of the src and tgt?

Answer 4 · 2023-01-06T02:42:06.000Z

First, you should pull the latest repository of our TextBox.
Then, you can read the instructions of pre-training. We have updated it to solve your concerns.
Thanks for your questions.

Answer 5 · 2023-01-06T03:07:18.000Z

Nice instructions! Thanks for your help!

Answer 6 · 2023-01-06T10:48:17.000Z

After pulling the latest version, I found a new error when training any generation task. This issue may be caused by the 9765c73.

Traceback (most recent call last):
  File "run_textbox.py", line 12, in <module>
    run_textbox(model=args.model, dataset=args.dataset, config_file_list=args.config_files, config_dict={})
  File "/mnt/windata/projects/TextBox/textbox/quick_start/quick_start.py", line 20, in run_textbox
    experiment = Experiment(model, dataset, config_file_list, config_dict)
  File "/mnt/windata/projects/TextBox/textbox/quick_start/experiment.py", line 39, in __init__
    self.config = Config(model, dataset, config_file_list, config_dict)
  File "/mnt/windata/projects/TextBox/textbox/config/configurator.py", line 72, in __init__
    self._set_associated_parameters()
  File "/mnt/windata/projects/TextBox/textbox/config/configurator.py", line 317, in _set_associated_parameters
    if self.final_config_dict['pretrain_task']:
KeyError: 'pretrain_task'

Answer 7 · 2023-01-06T11:06:52.000Z

We have fixed that in the latest pr. Thanks for your reporting!

Answer 8 · 2023-01-06T11:14:27.000Z

Thanks for the quick fix! 👍