Error with work.sh on large set of unlabeled text
andreasvc opened this issue · 1 comments
I tried to run a trained model on a "large" set of book reviews (15 MB).
I prepared the file as if it were a test set, with all tokens labeled as "O".
I get the following error:
Load checkpoint ./bert-tfm-bookreviews-finetune/checkpoint-1200/pytorch_model.bin...
test class count: [0. 0. 0.]
***** Running prediction *****
Evaluating: 0%| | 0/69420 [00:00<?, ?it/s]
Traceback (most recent call last):
File "work.py", line 216, in <module>
main()
File "work.py", line 125, in main
predict(args, model, tokenizer)
File "work.py", line 161, in predict
outputs = model(**inputs)
File "/home/p286012/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/var/tmp/andreas/bookreviews-absa/BERT-E2E-ABSA/absa_layer.py", line 437, in forward
attention_mask=attention_mask, head_mask=head_mask)
File "/home/p286012/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/p286012/.local/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 964, in forward
past_key_values_length=past_key_values_length,
File "/home/p286012/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/p286012/.local/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 206, in forward
embeddings += position_embeddings
RuntimeError: The size of tensor a (1304) must match the size of tensor b (512) at non-singleton dimension 1
sh work-unlabeled.sh > 398.30s user 17.21s system 101% cpu 6:50.79 total
512 happens to be the limit of BERT, so maybe the input didn't get truncated correctly, but I used the default of a maximum of 128 tokens per sentence.
work-unlabeled.sh is basically the default, only I used cased BERT:
#!/usr/bin/env bash
TASK_NAME="bookreviews-goodreads_rest"
ABSA_HOME="./bert-tfm-bookreviews-finetune"
CUDA_VISIBLE_DEVICES=0 python work.py --absa_home ${ABSA_HOME} \
--ckpt ${ABSA_HOME}/checkpoint-1200 \
--model_type bert \
--data_dir ./data/${TASK_NAME} \
--task_name ${TASK_NAME} \
--model_name_or_path bert-base-cased \
--cache_dir ./cache \
--max_seq_length 128 \
--tagging_schema BIEOS
similarly for train.sh
#!/usr/bin/env bash
TASK_NAME=bookreviews
ABSA_TYPE=tfm
CUDA_VISIBLE_DEVICES=0,2,3 python main.py --model_type bert \
--absa_type ${ABSA_TYPE} \
--tfm_mode finetune \
--fix_tfm 0 \
--model_name_or_path bert-base-cased \
--data_dir ./data/${TASK_NAME} \
--task_name ${TASK_NAME} \
--per_gpu_train_batch_size 16 \
--per_gpu_eval_batch_size 8 \
--learning_rate 2e-5 \
--do_train \
--do_eval \
--tagging_schema BIEOS \
--overfit 0 \
--overwrite_output_dir \
--eval_all_checkpoints \
--MASTER_ADDR localhost \
--MASTER_PORT 28512 \
--max_steps 1500
Thank you for pointing out this problem.
As you can see in the function convert_examples_to_seq_features, although we keep the parameter max_seq_length
, we did not use such pre-set value to do the truncation but set the max_seq_length as the length of the longest sequence because the sentences in the SemEval ABSA datasets are generally short.
So, regarding your problem, you should write several lines in convert_examples_to_seq_features to do the truncation according to the value of max_seq_length
. You can also check lines 232-235 which are the original code for the sequence truncation.