Adding many AddedTokens makes loading a tokenizer extremely slow.
stephantul opened this issue · 4 comments
Hi!
I'm not sure if this is a problem that can be solved, or needs to be solved. Basically, we want to make a kind of hybrid tokenizer, in which we add a whole bunch of whole words to a tokenizer, and select these words instead of the subwords if they appear.
For example: if we pass the pretokenized string ["dog", "walks", "around", "Paris"]
, and "Paris" is a whole token, we want to select it instead of decomposing it into subtokens. I think that adding Paris
as an AddedToken
is the right approach for this (but please correct me if I'm wrong.)
So, we added many of these tokens (about 400k), but this makes loading a tokenizer extremely slow, like, it takes 15-30 minutes to load. We now add them as regular tokens, which works fine, but which has the downside of also finding these whole word tokens as part of other words. For example Parisians
will now be turned into ["Paris", "##ians"]
, which might have a different meaning.
So my main question is: is there a reason why adding many AddedToken
s is slow? Or is this just a path that hasn't been fully optimized yet?
Is using AddedToken
s in this way simply wrong? Should we be trying something else?
Thanks!
Stéphan
Hey! It depends on which API you are using!
If you are using transformers
it was kind of expected as adding special and non special was hard.
If you are using pure tokenizers
, one thing is we have to add new regex match cases for each new token.
If you want to use a better way, I would recommend you to add them as regular + make sure you add the merge rules! This means adding paths to fusing these tokens! THis can be automatically done.
If that is of interest to you, provide me a reproducer with a model on the hub and I can helP!
Hey @ArthurZucker , thanks for your response!
I'm using the pure tokenizers
API. However, I am using a WordPiece tokenizer (actually just the baai/bge-base-en-v1.5
tokenizer, which AFAIK is just the OG bert tokenizer), not a BPE tokenizer. I see how adding merges to the BPE tokenizer could lead to a good solution though, so that's a cool idea.
So my vocabulary is a list with 400k tokens (just the vocabulary of the GLoVe vectors). So assuming vocab
is a list of 400k strings, this already takes a lot of time:
from tokenizers import Tokenizer
tok = Tokenizer.from_pretrained("baai/bge-base-en-v1.5")
tok.add_tokens(vocab)
This wouldn't really matter to me, but this cost is incurred every time the tokenizer is loaded from disk, which makes the cost of using it prohibitive. I could maybe convert it to BPE, but I'm not sure if that makes sense.
I'll upload the resulting tokenizer once it's done, and post another comment.
Thanks!