Extend tokens.txt with new tokens on pretrained model

Question

Extend tokens.txt with new tokens on pretrained model

gorosei-dev opened this issue 7 months ago · 1 comments

Suppose I want to further train the pretrained model on more data, but the new data contains some new tokens that are not covered in the tokens.txt / bpe.model, and I want the new model to be able to recognize these new tokens, how can I achieve this without retraining from scratch?

Answer 1 · 2024-03-20T02:52:57.000Z

You can reuse all parameters of your pre-trained model except for the output layer part, also remember to modify the lang_dir you are using for the later fine-tuning