nltk/nltk

word_tokenize() Failed to Split English Contractions When Followed by [\t\n\f\r]

donglihe-hub opened this issue · 2 comments

Hi Maintainers,

I found that nltk.word_tokenize() failed to divide contraction words like "he's", "book's" when followed by [\t\n\f\r]. The examples below explain the issue.

How to Reproduce

sentence_1 = "he's a good boy."
word_tokenize(sentence_1)
# ['he', "'s", 'a', 'good', 'boy', '.']
sentence_2 = "he's\t a good boy."
word_tokenize(sentence_2)
# ["he's", 'a', 'good', 'boy', '.']

"he's" in sentence_2 is not split because it is followed by "\t" rather than a white space. This issue also applies to other whitespace characters like [\n\f\r].

Here is another, though seemingly weird, example. When the contraction is the last word in a sentence, it behaves the same no matter if it is followed by a white space character or not.

sentence_3 = "he's\f a good boy. he's\t"
word_tokenize(sentence_3)
# ["he's", 'a', 'good', 'boy', '.', 'he', "'s"]

Expected Behaviors

Contractions can be correctly split no matter if they are followed by whitespace characters or not.

Environments

Python: 3.7.12 and 3.10.12
nltk (install via pip): 3.8.1