Error with work.sh on large set of unlabeled text

Question

Error with work.sh on large set of unlabeled text

andreasvc opened this issue 3 years ago · 1 comments

I tried to run a trained model on a "large" set of book reviews (15 MB).
I prepared the file as if it were a test set, with all tokens labeled as "O".
I get the following error:

Load checkpoint ./bert-tfm-bookreviews-finetune/checkpoint-1200/pytorch_model.bin...
test class count: [0. 0. 0.]
***** Running prediction *****
Evaluating:   0%|                                                                                                                                                    | 0/69420 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "work.py", line 216, in <module>
    main()
  File "work.py", line 125, in main
    predict(args, model, tokenizer)
  File "work.py", line 161, in predict
    outputs = model(**inputs)
  File "/home/p286012/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/var/tmp/andreas/bookreviews-absa/BERT-E2E-ABSA/absa_layer.py", line 437, in forward
    attention_mask=attention_mask, head_mask=head_mask)
  File "/home/p286012/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/p286012/.local/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 964, in forward
    past_key_values_length=past_key_values_length,
  File "/home/p286012/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/p286012/.local/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 206, in forward
    embeddings += position_embeddings
RuntimeError: The size of tensor a (1304) must match the size of tensor b (512) at non-singleton dimension 1
sh work-unlabeled.sh >   398.30s user 17.21s system 101% cpu 6:50.79 total

512 happens to be the limit of BERT, so maybe the input didn't get truncated correctly, but I used the default of a maximum of 128 tokens per sentence.

work-unlabeled.sh is basically the default, only I used cased BERT:

#!/usr/bin/env bash
TASK_NAME="bookreviews-goodreads_rest"
ABSA_HOME="./bert-tfm-bookreviews-finetune"
CUDA_VISIBLE_DEVICES=0 python work.py --absa_home ${ABSA_HOME} \
                      --ckpt ${ABSA_HOME}/checkpoint-1200 \
                      --model_type bert \
                      --data_dir ./data/${TASK_NAME} \
                      --task_name ${TASK_NAME} \
                      --model_name_or_path bert-base-cased \
                      --cache_dir ./cache \
                      --max_seq_length 128 \
                      --tagging_schema BIEOS

similarly for train.sh

#!/usr/bin/env bash
TASK_NAME=bookreviews
ABSA_TYPE=tfm
CUDA_VISIBLE_DEVICES=0,2,3 python main.py --model_type bert \
                         --absa_type ${ABSA_TYPE} \
                         --tfm_mode finetune \
                         --fix_tfm 0 \
                         --model_name_or_path bert-base-cased \
                         --data_dir ./data/${TASK_NAME} \
                         --task_name ${TASK_NAME} \
                         --per_gpu_train_batch_size 16 \
                         --per_gpu_eval_batch_size 8 \
                         --learning_rate 2e-5 \
                         --do_train \
                         --do_eval \
                         --tagging_schema BIEOS \
                         --overfit 0 \
                         --overwrite_output_dir \
                         --eval_all_checkpoints \
                         --MASTER_ADDR localhost \
                         --MASTER_PORT 28512 \
                         --max_steps 1500

Answer 1 · 2021-06-29T11:36:55.000Z

Thank you for pointing out this problem.

As you can see in the function convert_examples_to_seq_features, although we keep the parameter max_seq_length, we did not use such pre-set value to do the truncation but set the max_seq_length as the length of the longest sequence because the sentences in the SemEval ABSA datasets are generally short.

So, regarding your problem, you should write several lines in convert_examples_to_seq_features to do the truncation according to the value of max_seq_length. You can also check lines 232-235 which are the original code for the sequence truncation.