wrong token

Question

wrong token

CnsAd opened this issue 8 years ago · 3 comments

INPUT:

import spacy
nlp = spacy.load('en')
doc1 = nlp(u"They have killed the bat last night. We were so scared!")
for token in doc1:
    print(token)

OUTPUT:
They
have
killed
the
bat
last
night
.
We
we
re
so
scared
the "were" has been tokenized wrongly!

Answer 1 · 2017-01-16T10:01:55.000Z

Noticed that as well, currently spotted this bug only at "were"

Answer 2 · 2017-01-16T11:54:56.000Z

Ah, this seems to be a mistake in the tokenizer exceptions. It's adding all contractions with and without apostrophes, but were and Were should obviously have been excluded (like it's currently done for well, hell, ill etc).

This is easy to fix – will do this now and add a regression test.

Answer 3 · 2018-05-09T04:38:45.000Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.