Prepend bos token
panx27 opened this issue · 1 comments
panx27 commented
In the original Llama repository, a BOS token is prepended during inference, as seen in this code snippet.
Given this, should we also prepend a BOS token for each document during the 2nd stage of pretraining to ensure alignment with the original model's practices?
From prior models such as GPT-2 and BLOOM, a <|endoftext|>
token is typically used to delineate separate documents. For example, a common approach is doc1 <eos> doc2 <eos> ....
While I'm uncertain about Llama-2's exact handling of this, maybe something like <bos> doc1 <eos> <bos> doc2 <eos> ...
?
panx27 commented
Based on this recent work, always adding sink tokens like BOS at the beginning might be helpful. I will close this issue.