out of memory when training a BBPE tokenizer on a large corpus

Question

out of memory when training a BBPE tokenizer on a large corpus

Opened this issue 2 months ago · 4 comments

Hi there, this may be stupid but I felt confused...
I compiled a corpus containing 20GB of pure raw text and wanted to train my customized BBPE tokenizer.

With the guidance of your NLP course (https://huggingface.co/learn/nlp-course/chapter2/4. It's friendly and easy to understand BTW!) I used the same code, and it seemed good at first. But it turned slow and ran out of memory soon when merging.

The program failed on servers with 1.5TB or 2TB memory.
From some tutorials, the BPE tokenizer may need to do heavy statistics in memory. But I guess there should be better solutions like multi-node training for top players like OAI (with closed models) or huggingface (with OS models) to train their own tokenizers on very large corpus (say 500GB texts or much more).

Anyone would kindly show me the way to do this in an industrialized style?

Answer 1 · 2024-11-15T20:52:51.000Z

Hey! I think this is related to #1539 and we are fixing it! 🤗

Answer 2 · 2024-11-15T20:53:08.000Z

We are doing a release to include this in a few days!

Answer 3 · 2024-12-29T22:13:16.000Z

@yucc-leon
Can you confirm the issue was solved?

Answer 4 · 2024-12-31T01:43:02.000Z

@yucc-leon Can you confirm the issue was solved?

I'll test it in a few days and report the result here!