spacy-pl/utils

Fix stopwords bug

Closed this issue · 4 comments

Our stopwords merged to spacy are not working.
DoD:

  • bug solved on new branch in our spacy repo
  • test for stop words writen
  • test are passed (result of passing tests published in issue)

How to reproduce bug:
cls = spacy.util.get_lang_class('pl')
nlp = cls()
nlp('w')[0].is_stop #not working for any stopword

analogously in Greek, it works:
cls=spacy.util.get_lang_class('el')
nlp=cls()
nlp('αδιάκοπα')[0].is_stop

Demo branch has no problem with running this code (with hotfix suggested by @MateuszOlko when initial lemmatizer version was completed):

import spacy
from spacy.lang.pl import Polish, PolishTagger  # hotfix for getting lemmatizer to work

nlp = Polish()
tagger = PolishTagger(nlp.vocab)  # hotfix for getting lemmatizer to work
nlp.add_pipe(tagger, first=True, name='polish_tagger')

pan_tadeusz = """
Litwo! Ojczyzno moja! ty jesteś jak zdrowie:
Ile cię trzeba cenić, ten tylko się dowie,
Kto cię stracił. Dziś piękność twą w całej ozdobie
Widzę i opisuję, bo tęsknię po tobie.
"""

doc = nlp(pan_tadeusz)
for token in doc[:10]:
    print(f"Token: {token} \n - alpha: {token.is_alpha}, \n - digit: {token.is_digit}, \n - stopword: {token.is_stop}")

Output:

Token: 
 
 - alpha: False, 
 - digit: False, 
 - stopword: False
Token: Litwo 
 - alpha: True, 
 - digit: False, 
 - stopword: False
Token: ! 
 - alpha: False, 
 - digit: False, 
 - stopword: False
Token: Ojczyzno 
 - alpha: True, 
 - digit: False, 
 - stopword: False
Token: moja 
 - alpha: True, 
 - digit: False, 
 - stopword: True
Token: ! 
 - alpha: False, 
 - digit: False, 
 - stopword: False
Token: ty 
 - alpha: True, 
 - digit: False, 
 - stopword: True
Token: jesteś 
 - alpha: True, 
 - digit: False, 
 - stopword: False
Token: jak 
 - alpha: True, 
 - digit: False, 
 - stopword: True
Token: zdrowie 
 - alpha: True, 
 - digit: False, 
 - stopword: False

It seems that the Tagger is needed in order for Lemmatizer (or Tokenizer?) to do any meaningful work on the tokens (even though the dummy tagger added on demo branch always returns the same thing).

If this is the case, I would suggest to test the language after loading one of the newly trained models and checking how well it works, and instead of writing the unit tests, package the model so that it can finally be loaded via spacy.load to make our lives easier. @Gizzio @DoomCoder what do you think?

Added a test on branch bug/stopwords
spacy-pl/spaCy@bfab4b0

Tested both on our build and fresh install (spaCy 2.1.1, python 3.6.6), unable to reproduce the bug.

IMPORTANT NOTE

The only problem I've noticed with stopwords was when the first letter was capital.
For example, in a sentence "Z ziemi włoskiej do Polski." the first token "Z" would not be recognized as a stopword. I thought there's nothing we can do about it since there's no way of telling whether that's some sort of abbreviation - well, there is, and it's called tokenizer_exceptions.
In conclusion, for each stopword beginning with a lowercase letter, we should consider adding a copy with the first letter capitalized.