Op + does only match 1 token
marmg opened this issue · 2 comments
I'm using the SpaczzRuler pipeline as specified in here to detect companies based on patterns. Is a very simple pipeline in which I'm trying to match uppercase tokens, but when using the operator + it matches only one token in uppercase, and not as many as possible. The documentation says:
+ | Require the pattern to match 1 or more times.
However, if using the * operator it indeed matches all possible times, as expected.
How to reproduce the behaviour
import spacy
from spaczz.pipeline import SpaczzRuler
model = spacy.blank('es')
spaczz_ruler = SpaczzRuler(model)
spaczz_ruler.add_patterns([
{"label": "COMPANY", 'pattern': [
{"IS_UPPER": True, "OP": "+"}, {"IS_PUNCT": True, "OP": "?"},
{"TEXT": {"REGEX": "S\.\s?[A-Z]\.?\s?[A-Z]?\.?"}},
{"IS_PUNCT": True, "OP": "?"}],
"type": "token", "id": "COMPANY SL"}
])
model.add_pipe(spaczz_ruler)
doc = model("My company is called LARGO AND MARMG S.L.")
print(doc.ents)
# (MARMG S.L.,)
model = spacy.blank('es')
spaczz_ruler = SpaczzRuler(model)
spaczz_ruler.add_patterns([
{"label": "COMPANY", 'pattern': [
{"IS_UPPER": True, "OP": "*"}, {"IS_PUNCT": True, "OP": "?"},
{"TEXT": {"REGEX": "S\.\s?[A-Z]\.?\s?[A-Z]?\.?"}},
{"IS_PUNCT": True, "OP": "?"}],
"type": "token", "id": "COMPANY SL"}
])
model.add_pipe(spaczz_ruler)
doc = model("My company is called LARGO AND MARMG S.L.")
print(doc.ents)
# (LARGO AND MARMG S.L.,)
Your Environment
Info about spaCy
- Platform: Windows-10-10.0.17134-SP0
- Python version: 3.8.6
- spaCy version: 2.3.5
- **spaczz Version Used: 0.5.0
Hi @marmg thank you for bringing this to my attention. I have been very busy with my work and personal life the past few months but I believe I can get this fixed over the next few days.
This stems from the SpaczzRuler inconsistencies outlined here: https://github.com/gandersen101/spaczz#SpaczzRuler-Inconsistencies but is actually caused by spaCy behaving in a way I didn't realize it did.
Due to those SpaczzRuler inconsistencies the ruler currently does fuzzy matches first, then regex, then token matches and expects each set of results to be ordered by ascending start token, then descending length, (then descending match ratio for fuzzy matches). However spaCy's Matcher (which spaczz's TokenMatcher uses under the hood) does not return results in ascending start token, then descending length, like spaCy's PhraseMatcher seems to. So I will need to add that sort operation to the end of spaczz's TokenMatcher to get the expected results.
See the results from spaCy's Matcher and PhraseMatcher respectively:
model = spacy.blank("es")
matcher = Matcher(model.vocab)
matcher.add("COMPANY",
[[
{"IS_UPPER": True, "OP": "+"},
{"IS_PUNCT": True, "OP": "?"},
{"TEXT": {"REGEX": "S\.\s?[A-Z]\.?\s?[A-Z]?\.?"}},
{"IS_PUNCT": True, "OP": "?"},
]]
)
doc = model("My company is called LARGO AND MARMG S.L.")
matches = matcher(doc)
for match_id, start, end in matches:
print(doc[start:end])
MARMG S.L.
AND MARMG S.L.
LARGO AND MARMG S.L.
model = spacy.blank("es")
matcher = PhraseMatcher(model.vocab)
matcher.add(
"COMPANY",
[model("MARMG S.L."), model("AND MARMG S.L."), model("LARGO AND MARMG S.L.")],
)
doc = model("My company is called LARGO AND MARMG S.L.")
matches = matcher(doc)
for match_id, start, end in matches:
print(doc[start:end])
LARGO AND MARMG S.L.
AND MARMG S.L.
MARMG S.L.
This actually ended up being a pretty straightforward fix. It will be in v0.5.3 releasing today. Let me know if you continue to run into issues. Thanks!