Preprocessing issue in mydatasets.py

Question

Preprocessing issue in mydatasets.py

rriva002 opened this issue 4 years ago · 1 comments

I was reading the documentation for the Torchtext Field object and I noticed that preprocessing happens after tokenization. This seems to conflict with the intention of the clean_str function, as adding it to the text field's preprocessing will split contractions, etc. on individual tokens (causing tokens with spaces in them) rather than an entire sentence. To fix this, the following statement on line 74:

text_field.preprocessing = data.Pipeline(clean_str)

can be replaced with something like this:

text_field.tokenize = lambda x: clean_str(x).split()

which will apply clean_str before tokenization (str.split() is the default tokenizer used by the Field object).

Answer 1 · 2020-08-30T14:10:08.000Z

Thank you for your suggestion. Could you submit a pull request to fix this problem?