RUCAIBox/TextBox

[🐛BUG] UnicodeDecodeError

Closed this issue · 3 comments

描述这个 bug
UnicodeDecodeError: 'gbk' codec can't decode byte 0xa6 in position 1105: illegal multibyte sequence

如何复现
C:\Users\dell>python ./TextBox/run_textbox.py --model=BART --dataset=samsum --model_path=facebook/bart-base

日志
06 Oct 20:08 INFO 66 parameters found.

General Hyper Parameters:

gpu_id: 0
use_gpu: True
device: cpu
seed: 2020
reproducibility: True
cmd: ./TextBox/run_textbox.py --model=BART --dataset=samsum --model_path=facebook/bart-base
filename: BART-samsum-2023-Oct-06_20-08-24
saved_dir: saved/
state: INFO
wandb: online

Training Hyper Parameters:

do_train: True
do_valid: True
optimizer: adamw
adafactor_kwargs: {'lr': 0.001, 'scale_parameter': False, 'relative_step': False, 'warmup_init': False}
optimizer_kwargs: {}
valid_steps: 1
valid_strategy: epoch
stopping_steps: 2
epochs: 50
learning_rate: 3e-05
train_batch_size: 4
grad_clip: 0.1
accumulation_steps: 48
disable_tqdm: False
resume_training: True

Evaluation Hyper Parameters:

do_test: True
lower_evaluation: True
multiref_strategy: max
bleu_max_ngrams: 4
bleu_type: nltk
smoothing_function: 0
corpus_bleu: False
rouge_max_ngrams: 2
rouge_type: files2rouge
meteor_type: pycocoevalcap
chrf_type: m-popovic
distinct_max_ngrams: 4
inter_distinct: True
unique_max_ngrams: 4
self_bleu_max_ngrams: 4
tgt_lang: en
metrics: ['rouge']
eval_batch_size: 16
corpus_meteor: True

Model Hyper Parameters:

model: BART
model_name: bart
model_path: facebook/bart-base
config_kwargs: {}
tokenizer_kwargs: {'use_fast': True}
generation_kwargs: {'num_beams': 5, 'no_repeat_ngram_size': 3, 'early_stopping': True}
efficient_kwargs: {}
efficient_methods: []
efficient_unfreeze_model: False
label_smoothing: 0.1

Dataset Hyper Parameters:

dataset: samsum
data_path: dataset/samsum
tgt_lang: en
src_len: 1024
tgt_len: 128
truncate: tail
metrics_for_best_model: ['rouge-1', 'rouge-2', 'rouge-l']
prefix_prompt: Summarize:

Unrecognized Hyper Parameters:

find_unused_parameters: False
load_type: from_pretrained
tokenizer_add_tokens: []

================================================================================
06 Oct 20:08 INFO Pretrain type: pretrain disabled
Traceback (most recent call last):
File "C:\Users\dell\TextBox\run_textbox.py", line 12, in
run_textbox(model=args.model, dataset=args.dataset, config_file_list=args.config_files, config_dict={'model_path': 'facebook/bart-base'})
File "C:\Users\dell\TextBox\textbox\quick_start\quick_start.py", line 20, in run_textbox
experiment = Experiment(model, dataset, config_file_list, config_dict)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\dell\TextBox\textbox\quick_start\experiment.py", line 56, in init
self._init_data(self.get_config(), self.accelerator)
File "C:\Users\dell\TextBox\textbox\quick_start\experiment.py", line 82, in _init_data
train_data, valid_data, test_data = data_preparation(config, tokenizer)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\dell\TextBox\textbox\data\utils.py", line 23, in data_preparation
train_dataset = AbstractDataset(config, 'train')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\dell\TextBox\textbox\data\abstract_dataset.py", line 25, in init
self.source_text = load_data(source_filename, max_length=self.quick_test)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\dell\TextBox\textbox\data\misc.py", line 25, in load_data
for line in fin:
UnicodeDecodeError: 'gbk' codec can't decode byte 0xa6 in position 1105: illegal multibyte sequence

这是因为在windows中读取了中文的原因,强烈建议在ubuntu系统中使用textbox,我们并没有针对windows系统进行测试。

这个问题可以修改这里(https://github.com/RUCAIBox/TextBox/blob/2.0.0/textbox/data/misc.py#L22)的代码临时解决。
https://blog.csdn.net/ProgramNovice/article/details/126712944

请问如果使用下载下来的数据集,修改那个文件下的代码来使用本地数据集内容?