ThilinaRajapakse/simpletransformers

Problem with the absence of attention_mask when using sliding_window

DarijaNS opened this issue · 0 comments

Hello!

I am trying to fine-tune an Electra model with my own dataset, as described HERE, and I am using these model arguments:

import torch
cuda_available = torch.cuda.is_available() 
cuda_available

logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

TRAIN_FILE = "train_set.txt"
VAL_FILE = "val_set.txt"

model_args = LanguageModelingArgs()

model_args.reprocess_input_data = True                 
model_args.overwrite_output_dir = True                  
model_args.num_train_epochs = 3                         
model_args.dataset_type = "simple"                     

model_args.sliding_window = True                        
model_args.max_seq_length = 512                         
model_args.train_batch_size=32                         
model_args.gradient_accumulation_steps=32               

model_args.config = {
    "embedding_size": 768,
    "hidden_size": 768,
    "intermediate_size": 3072,
    "num_attention_heads": 12
}

model_args.vocab_size=32000                           

model_args.evaluate_during_training = True              
model_args.evaluate_during_training_silent = False     
model_args.evaluate_during_training_verbose = True      
model_args.manual_seed = 42 

model = LanguageModelingModel(
    model_type="electra",
    model_name="electra",
    discriminator_name="classla/bcms-bertic",
    generator_name="classla/bcms-bertic-generator",
    args=model_args,
    use_cuda=cuda_available
)

model.train_model(TRAIN_FILE, eval_file=VAL_FILE)               

When setting model_args.sliding_window = True I always get this: We strongly recommend passing in an attention_mask since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.

I took a closer look at the source code in language_modeling_model.py and noticed that the attention masks are only created in this situation with use_hf_datasets:

....
                inputs = inputs.to(self.device)
                attention_mask = (
                    batch["attention_mask"].to(self.device)
                    if self.args.use_hf_datasets
                    else None
                )
                token_type_ids = (
                    batch["token_type_ids"].to(self.device)
                    if self.args.use_hf_datasets and "token_type_ids" in batch
                    else None
                )
...

I assume that without the use of sliding_window no padding is added, so this warning does not occur.

Did I understand that correctly? I also tested the model evaluation on a test set with examples no longer than max_seq_length, with and without this parameter and found drastic differences in the results in terms of eval loss and perplexity.

So my question is: Is there a way to somehow include the attention_mask, which is generally important for training LM, or does it have no influence on the quality of model fine-tuning in this particular situation?

Thank you in advance!
Darija