Training a model from in-memory data
loicbarrault opened this issue · 1 comments
loicbarrault commented
Hi,
How could I change the code so that it is possible to train a model from in-memory data instead of using files?
Basically, changing
tokenizer.train(["wiki.test.raw"], vocab_size=20000)
by
tokenizer.train(data_array, vocab_size=20000)
considering that data_array is e.g. an array of sentences ["First sentence", "second sentence...].
Thanks for your work!
Best, Loic
Luckick commented
Refer to https://huggingface.co/docs/tokenizers/training_from_memory for example.