huggingface/tokenizers

Training a model from in-memory data

loicbarrault opened this issue · 1 comments

Hi,
How could I change the code so that it is possible to train a model from in-memory data instead of using files?
Basically, changing
tokenizer.train(["wiki.test.raw"], vocab_size=20000)
by
tokenizer.train(data_array, vocab_size=20000)
considering that data_array is e.g. an array of sentences ["First sentence", "second sentence...].
Thanks for your work!
Best, Loic