Reduce vocab size for BPE tokenizer
fzyzcjy opened this issue · 8 comments
Hi thanks for the library! I am using e.g. llama 3.1's tokenizer, but its 128k vocab size is too large for my field. Thus, to make training faster, I would like to reduce the tokenizer vocab size by removing the tokens that I will never use (e.g. words outside of my field). However, it seems tokenizers does not provide a convenient method for this.
Hey! I'll add the feature request as indeed we don't provide this out of the box.
You need to also re-map the ids of the model embeddings so it's a bit more involved.
If you directly modify the tokenizer.json
this can be achieved easily tho!
@ArthurZucker Thank you! Could you please provide a bit more details? I was thinking about modify tokenizer.json but gets worried about below:
For example, suppose I am only interested in token hello
. And suppose merges are h e, l l, ll o, he llo
(or something else). Then if I throw away tokens like h
, e
, ll
, ..., or throw away the merges, then I am worried I will never get the hello
token.
My naive thought is to keep all "parent" tokens (h
, e
, ll
, ...) and their merges not removed, but that makes vocab quite large. Is there a better way?
Well, you can switch to use a non BPE tokenizer for example.
A way to achieve that is to use added_tokens
. You can remove h e ll etc and add hello
as an added token
@ArthurZucker Thank you! However, then I am afraid the tokenization of the same sentence may be quite different. I am using a pretrained llama or something like that and doing some SFT, so I hope not to make tokenized results so wildly different that makes it confused.
Yeah completely get it.
It's kind of an open problem : how to effectively compress a tokenizer !
The main issue is that:
- Not all the merges are part of the vocab
- All tokens should be accessible with merges
- You don't necessarily need all merges from the vocab for your language
Here is what I would do:
- train a new tokenizer on your language, set the limit you want
- from the newly created vocab and merges, I know what tokens are needed for my language.
- remap these tokens / embeddings.
Thanks!
remap these tokens / embeddings
I would appreciate it if I could know a bit more. I am currently thinking about reducing the tokenizer, e.g. pick 10000 vocabs from the original 128000 vocab. Then we can pick the corresponding columns in the embedding/lm_head. Seems you are doing something even more complex: choose some vocabs that may even not appear from the original vocab. Then, I wonder how should we utilize the original embedding/lm_head.
What I am suggesting is the same as you, but with a way of selecting those 10000 vocabs (training a new tokenizer on relevant corpus) which should yield ~ the vocab you need or should at least be a good start
Thank you!