explosion/spaCy

wrong token

CnsAd opened this issue · 3 comments

CnsAd commented

INPUT:

import spacy
nlp = spacy.load('en')
doc1 = nlp(u"They have killed the bat last night. We were so scared!")
for token in doc1:
    print(token)

OUTPUT:
They
have
killed
the
bat
last
night
.
We
we
re
so
scared
the "were" has been tokenized wrongly!

Noticed that as well, currently spotted this bug only at "were"

ines commented

Ah, this seems to be a mistake in the tokenizer exceptions. It's adding all contractions with and without apostrophes, but were and Were should obviously have been excluded (like it's currently done for well, hell, ill etc).

This is easy to fix – will do this now and add a regression test.

lock commented

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.