MarcBS/multimodal_keras_wrapper

Dynamic tokenization during batch generation

howlinghuffy opened this issue · 1 comments

Is there currently a way to tokenize sentences in real time during batch generation, rather than as a dataset loading step?

I would like to implement a sentencepiece subword function (similar to bpe) that tokenizes a sentence in a slightly different way each time (this has been shown to improve translation quality for subword translation).

A single sentence, for example, may be randomly tokenized in 100 different ways.

I would like to do this tokenization in real time, so that I can just store the raw sentences in memory, rather than all 100 possible tokenization variations for each source and target sentence. Is this possible with the current keras_wrapper setup?

Hi @howlinghuffy ,

for doing this, you should modify a little bit the Dataset class. As it is now, we first apply a preprocessing function to all the text, tokenizing if required. Next, we load the text for each batch with the loadText method. So, you should modify this for having the dynamic tokenization. More precisely, I think it should be done around these lines; probably instead of doing

x = X[i].strip().split(' ')

it should call your dynamic_tokenizing_function.