huggingface/tokenizers

Reduce vocab size for BPE tokenizer

fzyzcjy opened this issue · 8 comments

Hi thanks for the library! I am using e.g. llama 3.1's tokenizer, but its 128k vocab size is too large for my field. Thus, to make training faster, I would like to reduce the tokenizer vocab size by removing the tokens that I will never use (e.g. words outside of my field). However, it seems tokenizers does not provide a convenient method for this.

Hey! I'll add the feature request as indeed we don't provide this out of the box.
You need to also re-map the ids of the model embeddings so it's a bit more involved.

If you directly modify the tokenizer.json this can be achieved easily tho!

@ArthurZucker Thank you! Could you please provide a bit more details? I was thinking about modify tokenizer.json but gets worried about below:

For example, suppose I am only interested in token hello. And suppose merges are h e, l l, ll o, he llo (or something else). Then if I throw away tokens like h, e, ll, ..., or throw away the merges, then I am worried I will never get the hello token.

My naive thought is to keep all "parent" tokens (h, e, ll, ...) and their merges not removed, but that makes vocab quite large. Is there a better way?

Well, you can switch to use a non BPE tokenizer for example.
A way to achieve that is to use added_tokens. You can remove h e ll etc and add hello as an added token

@ArthurZucker Thank you! However, then I am afraid the tokenization of the same sentence may be quite different. I am using a pretrained llama or something like that and doing some SFT, so I hope not to make tokenized results so wildly different that makes it confused.

Yeah completely get it.
It's kind of an open problem : how to effectively compress a tokenizer !
The main issue is that:

  1. Not all the merges are part of the vocab
  2. All tokens should be accessible with merges
  3. You don't necessarily need all merges from the vocab for your language

Here is what I would do:

  1. train a new tokenizer on your language, set the limit you want
  2. from the newly created vocab and merges, I know what tokens are needed for my language.
  3. remap these tokens / embeddings.

Thanks!

remap these tokens / embeddings

I would appreciate it if I could know a bit more. I am currently thinking about reducing the tokenizer, e.g. pick 10000 vocabs from the original 128000 vocab. Then we can pick the corresponding columns in the embedding/lm_head. Seems you are doing something even more complex: choose some vocabs that may even not appear from the original vocab. Then, I wonder how should we utilize the original embedding/lm_head.

What I am suggesting is the same as you, but with a way of selecting those 10000 vocabs (training a new tokenizer on relevant corpus) which should yield ~ the vocab you need or should at least be a good start

Thank you!