Preprocessing custom dataset without removing punctuation
ninavdPipple opened this issue · 1 comments
ninavdPipple commented
Hi,
I'm trying to load a custom dataset without removing the punctuation. However, if I set remove_punctuation = False, still all punctuation is removed and even worse; words connected to any punctuation are also gone. For example, 'Good evening!' simply becomes 'Good' in the corpus. How can I fix this? Ideally I want to remove all punctuation except '<' and '>', but I cannot come to any configuration where some punctuation is left at all.
Thanks in advance!
Nina
ninavdPipple commented
I figured this has to do with the fact that inside the preprocessing a vocabulary is created in which automatically all punctuation is removed. By ignoring the vocabulary, this could be avoided.