Pre-traininng from scratch
nlpmc opened this issue · 8 comments
Thanks for open-sourcing this exciting tool!
When I used TextBox for pre-training the BART from scratch, I found that the corpus mentioned in the document wudao
has not been provided. Where can I get this data?
python run_textbox.py --model=BART --dataset=wudao --pretrain_task=denoising
Since I did not have the wudao
dataset, I try to use the example dataset samsum
for pre-training a BART using the denoising task.
However, I got the following error:
05 Jan 23:25 INFO ====== Start training ======
train 1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 77/77 [03:16<00:00, 2.55s/step, loss=2.78]
05 Jan 23:29 INFO Train epoch 1 [time: 196.60s, loss: 2.78]
generating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 52/52 [01:22<00:00, 1.58s/it]
05 Jan 23:30 ERROR Traceback (most recent call last):
File "/mnt/windata/projects/TextBox/textbox/utils/dashboard.py", line 312, in new_experiment
yield True
File "/mnt/windata/projects/TextBox/textbox/quick_start/experiment.py", line 136, in run
self._do_train_and_valid()
File "/mnt/windata/projects/TextBox/textbox/quick_start/experiment.py", line 111, in _do_train_and_valid
self.valid_result = self.trainer.fit(train_data, valid_data)
File "/mnt/windata/projects/TextBox/textbox/trainer/trainer.py", line 455, in fit
self.stopped |= self._valid(valid_data, 'epoch')
File "/home/xxx/anaconda3/envs/TextBox/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/mnt/windata/projects/TextBox/textbox/trainer/trainer.py", line 294, in _valid
valid_results = self.evaluate(valid_data, is_valid=True)
File "/home/xxx/anaconda3/envs/TextBox/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/mnt/windata/projects/TextBox/textbox/trainer/trainer.py", line 548, in evaluate
corpus_len = len(eval_data.dataset.target_text)
AttributeError: 'AbstractDataset' object has no attribute 'target_text'
Would you pls help me to find the mistake of using TextBox? Thanks!
If the wudao
cannot be published, could you share the file schema of wudao
and I can make a similar dataset file using my own data.
The dataset can be found at the link: https://resource.wudaoai.cn/home.
And the runing script need to be added with --do_test=False
. Thanks for your reporting!
Thanks for the quick reply!
Do you mean --do_valid=False
? I have tried do_test and the same error was raised. This script will remove the validation operation during the training and how to know the training progress (e.g. PPL)? And I also found that adding this script will stop the trainer from saving ckpt every epoch.
The samsum
is a summarization dataset that contains the document as src
and the summary as tgt
. However, the pre-training language model only uses plain text as input. When we use a plain text corpus like wudao, how to construct the dataset file? Using the same text as both of the src
and tgt
?
First, you should pull the latest repository of our TextBox.
Then, you can read the instructions of pre-training. We have updated it to solve your concerns.
Thanks for your questions.
Nice instructions! Thanks for your help!
After pulling the latest version, I found a new error when training any generation task. This issue may be caused by the 9765c73.
Traceback (most recent call last):
File "run_textbox.py", line 12, in <module>
run_textbox(model=args.model, dataset=args.dataset, config_file_list=args.config_files, config_dict={})
File "/mnt/windata/projects/TextBox/textbox/quick_start/quick_start.py", line 20, in run_textbox
experiment = Experiment(model, dataset, config_file_list, config_dict)
File "/mnt/windata/projects/TextBox/textbox/quick_start/experiment.py", line 39, in __init__
self.config = Config(model, dataset, config_file_list, config_dict)
File "/mnt/windata/projects/TextBox/textbox/config/configurator.py", line 72, in __init__
self._set_associated_parameters()
File "/mnt/windata/projects/TextBox/textbox/config/configurator.py", line 317, in _set_associated_parameters
if self.final_config_dict['pretrain_task']:
KeyError: 'pretrain_task'
We have fixed that in the latest pr. Thanks for your reporting!
Thanks for the quick fix! 👍