wrong token
CnsAd opened this issue · 3 comments
CnsAd commented
INPUT:
import spacy
nlp = spacy.load('en')
doc1 = nlp(u"They have killed the bat last night. We were so scared!")
for token in doc1:
print(token)
OUTPUT:
They
have
killed
the
bat
last
night
.
We
we
re
so
scared
the "were" has been tokenized wrongly!
keotic commented
Noticed that as well, currently spotted this bug only at "were"
ines commented
Ah, this seems to be a mistake in the tokenizer exceptions. It's adding all contractions with and without apostrophes, but were
and Were
should obviously have been excluded (like it's currently done for well
, hell
, ill
etc).
This is easy to fix – will do this now and add a regression test.
lock commented
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.