word_tokenize() Failed to Split English Contractions When Followed by [\t\n\f\r]
donglihe-hub opened this issue · 2 comments
donglihe-hub commented
Hi Maintainers,
I found that nltk.word_tokenize() failed to divide contraction words like "he's", "book's" when followed by [\t\n\f\r]. The examples below explain the issue.
How to Reproduce
sentence_1 = "he's a good boy."
word_tokenize(sentence_1)
# ['he', "'s", 'a', 'good', 'boy', '.']
sentence_2 = "he's\t a good boy."
word_tokenize(sentence_2)
# ["he's", 'a', 'good', 'boy', '.']
"he's" in sentence_2 is not split because it is followed by "\t" rather than a white space. This issue also applies to other whitespace characters like [\n\f\r].
Here is another, though seemingly weird, example. When the contraction is the last word in a sentence, it behaves the same no matter if it is followed by a white space character or not.
sentence_3 = "he's\f a good boy. he's\t"
word_tokenize(sentence_3)
# ["he's", 'a', 'good', 'boy', '.', 'he', "'s"]
Expected Behaviors
Contractions can be correctly split no matter if they are followed by whitespace characters or not.
Environments
Python: 3.7.12 and 3.10.12
nltk (install via pip): 3.8.1