how to deal with special tokens for multiple files
IamExperimenting opened this issue · 0 comments
IamExperimenting commented
Hi,
I have a question regarding Byte-Pair Encoding - Special tokens especailly, I have 1780 file with me which is my domain dataset, do I need to mention
- <|startoftext|> in the beginning of the text in each file and <|endoftext|> in the end of the text in the each file?
- or do I need to combine all 1780 files together as one? and mention <|endoftext|> at the end of text of each file, as Andrej mentioned this will let the model to consider as delimiter.
- minbpe is capable of handling those on it own?
- is there any specific format that I should prepare my data and pass to minbpe? like dataframe(each text file in each row)
can you please help me understand here @karpathy