RUCAIBox/TextBox

[🐛BUG]我在使用mBART模型和WMT19zh-en时碰到问题。

Opened this issue · 2 comments

描述这个 bug
我在使用mBART模型和WMT19zh-en时碰到以下问题。

如何复现
run_textbox.py --model=mBART --model_path=facebook/mbart-large-cc25 --dataset=wmt19-zh-en --src_lang=zh_CN --tgt_lang=en_XX

日志
23 Apr 00:43 INFO Pretrain type: pretrain disabled
:1: SyntaxWarning: 'int' object is not callable; perhaps you missed a comma?
:1: SyntaxWarning: 'int' object is not callable; perhaps you missed a comma?
:1: SyntaxWarning: 'int' object is not callable; perhaps you missed a comma?
:1: SyntaxWarning: 'int' object is not callable; perhaps you missed a comma?
:1: SyntaxWarning: 'int' object is not callable; perhaps you missed a comma?
:1: SyntaxWarning: 'int' object is not callable; perhaps you missed a comma?
:1: SyntaxWarning: 'int' object is not callable; perhaps you missed a comma?
:1: SyntaxWarning: 'int' object is not callable; perhaps you missed a comma?
:1: SyntaxWarning: 'int' object is not callable; perhaps you missed a comma?
:1: SyntaxWarning: 'int' object is not callable; perhaps you missed a comma?
:1: SyntaxWarning: 'int' object is not callable; perhaps you missed a comma?
:1: SyntaxWarning: list indices must be integers or slices, not tuple; perhaps you missed a comma?
:1: SyntaxWarning: 'str' object is not callable; perhaps you missed a comma?
:1: SyntaxWarning: list indices must be integers or slices, not tuple; perhaps you missed a comma?
:1: SyntaxWarning: list indices must be integers or slices, not tuple; perhaps you missed a comma?
:1: SyntaxWarning: list indices must be integers or slices, not tuple; perhaps you missed a comma?
:1: SyntaxWarning: list indices must be integers or slices, not tuple; perhaps you missed a comma?
:1: SyntaxWarning: list indices must be integers or slices, not tuple; perhaps you missed a comma?
:1: SyntaxWarning: list indices must be integers or slices, not tuple; perhaps you missed a comma?
:1: SyntaxWarning: list indices must be integers or slices, not tuple; perhaps you missed a comma?
:1: SyntaxWarning: list indices must be integers or slices, not tuple; perhaps you missed a comma?
:1: SyntaxWarning: list indices must be integers or slices, not tuple; perhaps you missed a comma?
:1: SyntaxWarning: list indices must be integers or slices, not tuple; perhaps you missed a comma?
:1: SyntaxWarning: list indices must be integers or slices, not tuple; perhaps you missed a comma?
:1: SyntaxWarning: list indices must be integers or slices, not tuple; perhaps you missed a comma?
Token indices sequence length is longer than the specified maximum sequence length for this model (1776 > 1024). Running this sequence through the model will result in indexing errors
Traceback (most recent call last):
File "run_textbox.py", line 15, in
run_textbox(model=args.model, dataset=args.dataset, config_file_list=args.config_files, config_dict={})
File "/hy-tmp/TextBox/textbox/quick_start/quick_start.py", line 20, in run_textbox
experiment = Experiment(model, dataset, config_file_list, config_dict)
File "/hy-tmp/TextBox/textbox/quick_start/experiment.py", line 56, in init
self._init_data(self.get_config(), self.accelerator)
File "/hy-tmp/TextBox/textbox/quick_start/experiment.py", line 82, in _init_data
train_data, valid_data, test_data = data_preparation(config, tokenizer)
File "/hy-tmp/TextBox/textbox/data/utils.py", line 24, in data_preparation
train_dataset.tokenize(tokenizer)
File "/hy-tmp/TextBox/textbox/data/abstract_dataset.py", line 120, in tokenize
ids = tokenizer(
File "/usr/local/miniconda3/envs/TextBox/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2538, in call
encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
File "/usr/local/miniconda3/envs/TextBox/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2624, in _call_one
return self.batch_encode_plus(
File "/usr/local/miniconda3/envs/TextBox/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2815, in batch_encode_plus
return self._batch_encode_plus(
File "/usr/local/miniconda3/envs/TextBox/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py", line 428, in _batch_encode_plus
encodings = self._tokenizer.encode_batch(
TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

其中,我使用的transformers版本为4.28.1,torch版本为2.0.0+cu117

你可以临时注释 https://github.com/RUCAIBox/TextBox/blob/2.0.0/textbox/data/misc.py 中的27~34行,我们之后会尽快修复

如果有问题欢迎继续提问