Why set `tokenizer.pad_token = tokenizer.eos_token`
Closed this issue · 3 comments
Hi there, thanks for your article and the providing repos/LLM-Alchemy-Chamber/Finetuning/Gemma_finetuning_notebook.ipynb
as a reference!
I was wondering if you could explain why you set tokenizer.pad_token = tokenizer.eos_token
before starting training?
I found some info on this here https://stackoverflow.com/a/76453052
Hey, that's a really good question.
So, we set the pad token to the end of sentence token (EOS) because the length of data varies in the dataset. When we're fine-tuning, let's say we have a context window of 8k tokens, and after tokenization of the data, we have 7k tokens. The remaining space in the tokenizer is padded with the end of sentence token.
This ensures that during inference, it will generate the EOS token when it should ideally end the reply instead of repeating or generating something irrelevant.
Some models have a PAD token they would have used for training, but when fine-tuning, I think it's a good practice to set the pad token to the EOS to avoid repetition.
Hi there, thanks for your article and the providing
repos/LLM-Alchemy-Chamber/Finetuning/Gemma_finetuning_notebook.ipynb
as a reference!I was wondering if you could explain why you set
tokenizer.pad_token = tokenizer.eos_token
before starting training?
@adithya-s-k thank you for the explanation that helps a lot!