huggingface/notebooks

tokenizer warning for Multiple choice

jaideep11061982 opened this issue · 0 comments

https://github.com/huggingface/notebooks/blob/main/examples/multiple_choice.ipynb
I think when we do tokenizer.pad in collator , its a slow operation so there is warning that suggests that when we do tokenizer( )
we can always padding =True there .
Doing it inside collator slows the training, any way we can use padding option of tokenizer directly ?

accepted_keys = ["input_ids", "attention_mask", "label"]
features = [{k: v for k, v in encoded_datasets["train"][i].items() if k in accepted_keys} for i in range(10)]
batch = DataCollatorForMultipleChoice(tokenizer)(features)