Training a model from in-memory data

Question

Training a model from in-memory data

loicbarrault opened this issue 5 years ago · 1 comments

Hi,
How could I change the code so that it is possible to train a model from in-memory data instead of using files?
Basically, changing
tokenizer.train(["wiki.test.raw"], vocab_size=20000)
by
tokenizer.train(data_array, vocab_size=20000)
considering that data_array is e.g. an array of sentences ["First sentence", "second sentence...].
Thanks for your work!
Best, Loic

Answer 1 · 2022-07-22T18:44:01.000Z

Refer to https://huggingface.co/docs/tokenizers/training_from_memory for example.