Fine-tuned tokenizer

Question

Fine-tuned tokenizer

ChinJianGin opened this issue 7 months ago · 2 comments

Hello,
Your project is awesome, and I'm delighted to have such a fantastic project.
When I cloned your project and trained it myself, I tried to save the tokenizer like save model, but I found the tokenizer_config.json and tokenizer.json not like your config. I don't know how to resize the vocab from 50256 to 256 and set "endoftext" token to id = 0. Could you give me some tips how to fine-tune the tokenizer? I did this because when I ran trained model, I have to use your tokenizer, if I use the tokenizer that I saved decode will go wrong.

This is my tokenizer_config.json file

This is my tokenizer.json file

Answer 1 · 2024-05-08T02:32:53.000Z

Hey! You can see an example for training in the train notebook, but here’s where I get the tokenizer ready: https://github.com/shyamsn97/mario-gpt/blob/main/mario_gpt/dataset.py#L68

Answer 2 · 2024-05-08T03:30:22.000Z

Oh! I got it! Thank you very much for help. Have a nice day!