Fix stopwords bug
Closed this issue · 4 comments
Our stopwords merged to spacy are not working.
DoD:
- bug solved on new branch in our spacy repo
- test for stop words writen
- test are passed (result of passing tests published in issue)
How to reproduce bug:
cls = spacy.util.get_lang_class('pl')
nlp = cls()
nlp('w')[0].is_stop #not working for any stopword
analogously in Greek, it works:
cls=spacy.util.get_lang_class('el')
nlp=cls()
nlp('αδιάκοπα')[0].is_stop
Demo branch has no problem with running this code (with hotfix suggested by @MateuszOlko when initial lemmatizer version was completed):
import spacy
from spacy.lang.pl import Polish, PolishTagger # hotfix for getting lemmatizer to work
nlp = Polish()
tagger = PolishTagger(nlp.vocab) # hotfix for getting lemmatizer to work
nlp.add_pipe(tagger, first=True, name='polish_tagger')
pan_tadeusz = """
Litwo! Ojczyzno moja! ty jesteś jak zdrowie:
Ile cię trzeba cenić, ten tylko się dowie,
Kto cię stracił. Dziś piękność twą w całej ozdobie
Widzę i opisuję, bo tęsknię po tobie.
"""
doc = nlp(pan_tadeusz)
for token in doc[:10]:
print(f"Token: {token} \n - alpha: {token.is_alpha}, \n - digit: {token.is_digit}, \n - stopword: {token.is_stop}")
Output:
Token:
- alpha: False,
- digit: False,
- stopword: False
Token: Litwo
- alpha: True,
- digit: False,
- stopword: False
Token: !
- alpha: False,
- digit: False,
- stopword: False
Token: Ojczyzno
- alpha: True,
- digit: False,
- stopword: False
Token: moja
- alpha: True,
- digit: False,
- stopword: True
Token: !
- alpha: False,
- digit: False,
- stopword: False
Token: ty
- alpha: True,
- digit: False,
- stopword: True
Token: jesteś
- alpha: True,
- digit: False,
- stopword: False
Token: jak
- alpha: True,
- digit: False,
- stopword: True
Token: zdrowie
- alpha: True,
- digit: False,
- stopword: False
It seems that the Tagger
is needed in order for Lemmatizer
(or Tokenizer
?) to do any meaningful work on the tokens (even though the dummy tagger added on demo branch always returns the same thing).
If this is the case, I would suggest to test the language after loading one of the newly trained models and checking how well it works, and instead of writing the unit tests, package the model so that it can finally be loaded via spacy.load
to make our lives easier. @Gizzio @DoomCoder what do you think?
Added a test on branch bug/stopwords
spacy-pl/spaCy@bfab4b0
Tested both on our build and fresh install (spaCy 2.1.1, python 3.6.6), unable to reproduce the bug.
IMPORTANT NOTE
The only problem I've noticed with stopwords was when the first letter was capital.
For example, in a sentence "Z ziemi włoskiej do Polski." the first token "Z" would not be recognized as a stopword. I thought there's nothing we can do about it since there's no way of telling whether that's some sort of abbreviation - well, there is, and it's called tokenizer_exceptions.
In conclusion, for each stopword beginning with a lowercase letter, we should consider adding a copy with the first letter capitalized.