Dynamic tokenization during batch generation
howlinghuffy opened this issue · 1 comments
Is there currently a way to tokenize sentences in real time during batch generation, rather than as a dataset loading step?
I would like to implement a sentencepiece subword function (similar to bpe) that tokenizes a sentence in a slightly different way each time (this has been shown to improve translation quality for subword translation).
A single sentence, for example, may be randomly tokenized in 100 different ways.
I would like to do this tokenization in real time, so that I can just store the raw sentences in memory, rather than all 100 possible tokenization variations for each source and target sentence. Is this possible with the current keras_wrapper setup?
Hi @howlinghuffy ,
for doing this, you should modify a little bit the Dataset
class. As it is now, we first apply a preprocessing function to all the text, tokenizing if required. Next, we load the text for each batch with the loadText method. So, you should modify this for having the dynamic tokenization. More precisely, I think it should be done around these lines; probably instead of doing
multimodal_keras_wrapper/keras_wrapper/dataset.py
Line 2114 in b1e588a
it should call your
dynamic_tokenizing_function
.