Issue of max_seq_length in MLM pretraining data preprocessing

Question

Issue of max_seq_length in MLM pretraining data preprocessing

Closed this issue 3 years ago · 5 comments

Hi,
I find that in the functions segment_pair_nsp_process and doc_sentences_process in examples/transformers/language-modeling/dataset_processing.py, the sequence length of the processed data is actually max_seq_length - tokenizer.num_special_tokens_to_add(pair=False) since variable max_seq_length is replaced by this value and have been passed to the tokenizer.prepare_for_model function.
Such as user set max_seq_length=128, and the processed data will have a sequence length of 125.
I'm not sure is it the standard way of pretraining data preprocessing?

Answer 1 · 2022-02-23T20:24:01.000Z

Hi,
The maximum sequence length referfs to the maximum number of tokens per input that will be passed to the model which includes also the special tokens. In BERT's case when passing a sentence pair you will need 3 special tokens in addition to the tokenized text, [CLS] token and 2 [SEP] tokens to seperate the sentences and at the end of the input.

Answer 2 · 2022-02-24T01:47:39.000Z

Yes, this is consistent with my understanding and observation on the tokenized input data. Sorry that I have not state the issue clearly, what I mean is actually that with [CLS] token and 2 [SEP] tokens already added into the tokenized text as the input that will be passed to the model, its length is 125 instead of 128 which is set for max_seq_length.

Answer 3 · 2022-02-26T23:38:50.000Z

I am not sure I understand your question. Please clarify it or add an example.

Answer 4 · 2022-02-28T02:53:16.000Z

I use below command to pretrain MLM, with max_seq_length set to 128 and pad_to_max_length set to True, but I find that the shape of the input_ids which is the tokenized input for the model is 32x125 instead of 32x128.

python run_mlm.py --config_name google/mobilebert-uncased --tokenizer_name google/mobilebert-uncased --datasets_name_config wikipedia:20200501.en bookcorpusopen  --do_train --data_process_type segment_pair_nsp --max_seq_length 128 --per_device_train_batch_size 32 --per_device_eval_batch_size 32 --gradient_accumulation_steps 32 --save_total_limit 3 --num_train_epochs 5 --pad_to_max_length --dataset_cache_dir preprocessed_datasets_wikipedia_bookcorpusopen_128 --output_dir test --overwrite_output_dir --preprocessing_num_workers 48

Answer 5 · 2022-03-08T15:48:15.000Z

Thanks for reporting this bug. Was fixed in #8 and will be updated in the next version.
You can pull the latest main to apply the fix.