epfLLM/Megatron-LLM

Prepend bos token

panx27 opened this issue · 1 comments

panx27 commented

In the original Llama repository, a BOS token is prepended during inference, as seen in this code snippet.

Given this, should we also prepend a BOS token for each document during the 2nd stage of pretraining to ensure alignment with the original model's practices?

From prior models such as GPT-2 and BLOOM, a <|endoftext|> token is typically used to delineate separate documents. For example, a common approach is doc1 <eos> doc2 <eos> .... While I'm uncertain about Llama-2's exact handling of this, maybe something like <bos> doc1 <eos> <bos> doc2 <eos> ...?

panx27 commented

Based on this recent work, always adding sink tokens like BOS at the beginning might be helpful. I will close this issue.