Problem with the absence of attention_mask when using sliding_window
DarijaNS opened this issue · 0 comments
Hello!
I am trying to fine-tune an Electra model with my own dataset, as described HERE, and I am using these model arguments:
import torch
cuda_available = torch.cuda.is_available()
cuda_available
logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)
TRAIN_FILE = "train_set.txt"
VAL_FILE = "val_set.txt"
model_args = LanguageModelingArgs()
model_args.reprocess_input_data = True
model_args.overwrite_output_dir = True
model_args.num_train_epochs = 3
model_args.dataset_type = "simple"
model_args.sliding_window = True
model_args.max_seq_length = 512
model_args.train_batch_size=32
model_args.gradient_accumulation_steps=32
model_args.config = {
"embedding_size": 768,
"hidden_size": 768,
"intermediate_size": 3072,
"num_attention_heads": 12
}
model_args.vocab_size=32000
model_args.evaluate_during_training = True
model_args.evaluate_during_training_silent = False
model_args.evaluate_during_training_verbose = True
model_args.manual_seed = 42
model = LanguageModelingModel(
model_type="electra",
model_name="electra",
discriminator_name="classla/bcms-bertic",
generator_name="classla/bcms-bertic-generator",
args=model_args,
use_cuda=cuda_available
)
model.train_model(TRAIN_FILE, eval_file=VAL_FILE)
When setting model_args.sliding_window = True
I always get this: We strongly recommend passing in an attention_mask
since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.
I took a closer look at the source code in language_modeling_model.py and noticed that the attention masks are only created in this situation with use_hf_datasets
:
....
inputs = inputs.to(self.device)
attention_mask = (
batch["attention_mask"].to(self.device)
if self.args.use_hf_datasets
else None
)
token_type_ids = (
batch["token_type_ids"].to(self.device)
if self.args.use_hf_datasets and "token_type_ids" in batch
else None
)
...
I assume that without the use of sliding_window
no padding is added, so this warning does not occur.
Did I understand that correctly? I also tested the model evaluation on a test set with examples no longer than max_seq_length
, with and without this parameter and found drastic differences in the results in terms of eval loss and perplexity.
So my question is: Is there a way to somehow include the attention_mask
, which is generally important for training LM, or does it have no influence on the quality of model fine-tuning in this particular situation?
Thank you in advance!
Darija