jerryji1993/DNABERT

Can you use the pre-trained BERT models, but add novel tokens to the vocabulary?

mepster opened this issue · 0 comments

Can you use the pre-trained BERT models, but add novel tokens to the vocabulary during fine-tuning? Any tips on what's needed for this?

Or during fine-tuning MUST you use the same vocab.txt file that was used in pre-training?

I want to add some of the IUPAC symbols, for example the symbol Y which means "T or C". So that will expand my vocabulary a lot.

But I don't have the resources to retrain.

Related, but I believe talking about training from scratch:
#81