huggingface/tokenizers

Tokenizer training killed

AmitChaulwar opened this issue · 9 comments

I was trying to generate a wordpiece vocab for multiple languages with a corpus of ~60 GB using tokenizers train function.
(See in the image)
Screenshot 2020-11-04 at 08 38 12

The training procedure ran for a day and then it got killed. I assume this was because some memory problem. Can you confirm this and is there any way we can train the tokenizer for big corpus? I actually want to later generate vocab for even larger corpus than this.

Also, is it possible to use somehow GPU for training the tokenizer? Right now, it is using CPU only.

Hi @AmitChaulwar.
This indeed looks like a memory error. You can probably confirm by checking dmesg output.

There are definitely ways to train models for large corpus, however it will inevitably require a large amount of memory as a full representation of the data has to hold in memory. But there are ways to reduce this.

  • (Most likely problem here) Use an adapted pre_tokenizer: Cutting neighborhood of words allows the algorithm to keep track of less neighbors and pairs will be a much smaller structure. That is why Whitespace pre_tokenizer is almost always used in whitespace languages. In non whitespace langages, that becomes a problem, there are some specific pre_tokenizers for each language (jieba for instance). If you use a mix of both types of langagtes, UnicodeScripts pre_tokenizer could help too. (It will force mixed type of unicode characters to belong to separate tokens). Try to run your tokenizer on a smaller piece of the data to make sure that the output vocabulary actually makes sense for your data, it's a good way to check that the pre_tokenizers are doing their job correctly.
  • Also make sure your normalizers are correctly in place as there are many ways to represent some text in non latin languages and that will make your representation explode in this (and also make your tokenizer very poor).
  • Another direction would be to use a ByteLevelEncoder as they ignore various chars and starts of with a much smaller base_vocabulary. But it has other caveats.

Finally, please make sure training your tokenizer on such a large corpus actually makes a difference. It's not necessarily the case depending on the regularity of your data. Filtering out all bogus characters like ========== (either before hand or with a normalizer) will definitely reduce the required amount of memory and probably will lead to a better tokenizer in the end.

Edit: No there is no way to use a GPU at the moment, and we will probably not implement that in the foreseeable future, but tokenizers does parallelize computation wherever possible.

Thank you for the detailed answer. I will try your recommendations.

Even with your suggestions, the process gets killed. I believe the corpus is too large. One hack I could think of is generating individual vocab for each language and then merge it. I know it has multiple cons but that is the best solution I have now which works fairly well with wordpiece vocab.

The question is now how can I merge ByteLevelBytePairEncoding vocabs of different languages as it has two files merges.txt and 'vocab.json. Can you suggest some way?

You probably should focus on merges.txt as vocab.json can (mostly) be recreated from them.

Just take everypair from everyfiles and add them again iteratively if the resulting token does yet exist.
Something like.

# pairs is a list of all (a, b, score/merge_order)

sort(pairs, key=lamba x: x[2]) # Sort by increasing order globally

tokens = set() # Some initial alphabe. ByteLevel.initial_alphabet should exist
final_pairs = []
for a, b, _ in pairs:
    new_token = "".join([a, b])
    if  new_token not in set:
        new_pairs.append((a, b))

Ok, I got it. Is there any script already available to generate vocab.json using merges.txt?
I will have to write it otherwise.

No, there's none already written.

And again, please try training a tokenizer on a small subset of your data it should really be your first effort to see how it behaves and checking that the result vocabulary is coherent

Thanks for the info. I have already tested it on a small subset of data. The wordpiece vocabulary merging worked fairly well using large corpus (I got better results using that vocabulary on fine-tuning with BERT for a specific task). Hopefully, the merging of BLBPE works even better.

Okay please let us know how it works. If it starts to become a common use case we'll probably consider adding this to the core lib. Memory issues seem to pop-up quite regularly, our first focus will be to reduce our footprint, but there's only so much we can do, splitting training across chunks of data is another option.

Well, finally I am training the BLBPE Vocab with more RAM. It was unnecessarily complicated to combine the vocabs for BLBPE although if one really wanted to do, one can do. However, it looks like not a bad idea if memory problem occurs. If you are planning to add this to core lib, make sure that you generate a balanced vocabulary.

For wordpiece vocabulary, I pre-calculated what should be the size of the vocabulary for each language given total vocabulary size and then merge text vocabulary files removing duplicates. This helps it to make balanced. However, the total size of vocabulary reduces and you do not have control over it. A feature of generating balanced multilingual vocabulary would be nice.