Preprocessing custom dataset without removing punctuation

Question

Preprocessing custom dataset without removing punctuation

ninavdPipple opened this issue a year ago · 1 comments

Hi,
I'm trying to load a custom dataset without removing the punctuation. However, if I set remove_punctuation = False, still all punctuation is removed and even worse; words connected to any punctuation are also gone. For example, 'Good evening!' simply becomes 'Good' in the corpus. How can I fix this? Ideally I want to remove all punctuation except '<' and '>', but I cannot come to any configuration where some punctuation is left at all.
Thanks in advance!
Nina

Answer 1 · 2024-01-05T07:49:08.000Z

I figured this has to do with the fact that inside the preprocessing a vocabulary is created in which automatically all punctuation is removed. By ignoring the vocabulary, this could be avoided.