Why set `tokenizer.pad_token = tokenizer.eos_token`

Question

Why set `tokenizer.pad_token = tokenizer.eos_token`

Closed this issue 8 months ago · 3 comments

Hi there, thanks for your article and the providing repos/LLM-Alchemy-Chamber/Finetuning/Gemma_finetuning_notebook.ipynb as a reference!

I was wondering if you could explain why you set tokenizer.pad_token = tokenizer.eos_token before starting training?

Answer 1 · 2024-04-29T11:57:19.000Z

I found some info on this here https://stackoverflow.com/a/76453052

Answer 2 · 2024-04-30T18:32:09.000Z

Hey, that's a really good question.

So, we set the pad token to the end of sentence token (EOS) because the length of data varies in the dataset. When we're fine-tuning, let's say we have a context window of 8k tokens, and after tokenization of the data, we have 7k tokens. The remaining space in the tokenizer is padded with the end of sentence token.

This ensures that during inference, it will generate the EOS token when it should ideally end the reply instead of repeating or generating something irrelevant.

Some models have a PAD token they would have used for training, but when fine-tuning, I think it's a good practice to set the pad token to the EOS to avoid repetition.

Hi there, thanks for your article and the providing repos/LLM-Alchemy-Chamber/Finetuning/Gemma_finetuning_notebook.ipynb as a reference!

I was wondering if you could explain why you set tokenizer.pad_token = tokenizer.eos_token before starting training?

Answer 3 · 2024-05-01T19:07:34.000Z

@adithya-s-k thank you for the explanation that helps a lot!