karpathy/minbpe

how to deal with special tokens for multiple files

IamExperimenting opened this issue · 0 comments

Hi,

I have a question regarding Byte-Pair Encoding - Special tokens especailly, I have 1780 file with me which is my domain dataset, do I need to mention

  1. <|startoftext|> in the beginning of the text in each file and <|endoftext|> in the end of the text in the each file?
  2. or do I need to combine all 1780 files together as one? and mention <|endoftext|> at the end of text of each file, as Andrej mentioned this will let the model to consider as delimiter.
  3. minbpe is capable of handling those on it own?
  4. is there any specific format that I should prepare my data and pass to minbpe? like dataframe(each text file in each row)

can you please help me understand here @karpathy