Plural is not chosen over similar word
brunobg opened this issue · 18 comments
How to reproduce: fuzzy pattern with "Goldriesling", "Riesling", default fuzzy_func. Search on a phrase like They sell many Rieslings
.
Found: Goldriesling
. Expected: Riesling
.
I also am getting a fuzzy result preferred over an exact match if the list contains similar names.
Ok, so here's what I found out. If I'm using SpaczzRuler with a reasonably large (~1000) list of items where some are similar words to others, it often finds something that is not what I expect.
- An exact match does not happen. Example: "Pinot Meunier" or "Pinot Nero" is found for "Pinot Noir". This seems to be affected by compound terms, so that it prefers to find a close term which the previous word makes part of another match. Example: "Riesling" is not found, but "Cape Riesling" is found for a phrase with "Eggo Riesling".
- Variations end up with words that are not as close as possible, such as the case finding "Goldriesling" for "Rieslings" instead of "Riesling".
I understand fuzzy matching can't always handle things perfectly, but this is a bit extreme... So, questions:
- Is there anyway to force exact matches to always "win"? I don't mind adding a spacy ruler if that helps, but it didn't. (how I did it:
nlp.add_pipe(spaczzRuler, before='ner'); nlp.add_pipe(ruler, before='ner')
. Order didn't make a difference). - any way to make "closer" matches, such as plurals, more likely?
- I think this is mostly related to compound terms, any thing that should be done about them? I also have to match "one-two" and "one two".
Thanks a lot for this package, looking forward to understanding how to deal with these issues!
Hi @brunobg. It's exciting to see more interest in spaczz! As I mentioned to you in issue #40, I am close to a new release for spaczz adding a token-based matcher, hopefully addressing the bug in issue #40, and a few other changes. After that is complete I will dig into your questions here. Just wanted you to know that I've seen this and will definitely address your issues in the near future.
Thank @gandersen101! I'm willing to test your new code and help debug any issues, so ping me if you think it's stable enough for that. Really looking forward to the new version.
Hey @brunobg, the code in master is stable so feel free to play around with it. There is a decent bit of refactoring and new features in there but most of the preexisting components behave mostly the same. I'm hoping to have the next release out tomorrow, Jan 19th, towards the end of the day EST, but it could take another couple days.
@gandersen101 I tested it, but I'm still getting very weird results, particularly with exact matches being missed and other matches returned instead, even with single words (Egito
for Argentina
).
My idea of using spaczz is handling typos for words that are not in dictionaries, so a spell checker would be useless. I have a list of around 2k terms in 8 categories. I'm happy to help with testing and bug fixing, but I'm a bit worried with these failures. Do you think spaczz could be using in production at this time?
Hey @brunobg, so I hadn't actually had a lot of time to explore your questions here yet so the code in master shouldn't result in many differences in results from what you're seeing in the latest release.
I will caution you that spaczz is a one person, passion project that is still pre v1.0.0. Any production use should strictly be at your own risk. I use spaczz in a batch process at my job and haven't run into issues yet but obviously bugs/issues will pop up as more people use spaczz and put it through more stress and different use-cases.
I did have a little time to play with some of your example this morning and I am a little confused by some of the issues you are facing because I am either seeing different results or spaczz is working as intended. I am using the code on master for these tests.
Test 1: Blank English Model, No Kwarg/Default Modifications
import spacy
from spaczz.pipeline import SpaczzRuler
nlp = spacy.blank("en")
ruler = SpaczzRuler(nlp)
patterns = [
{"label": "WINE", "pattern": "Goldriesling", "type": "fuzzy"},
{"label": "WINE", "pattern": "Riesling", "type": "fuzzy"},
{"label": "WINE", "pattern": "Eggo Riesling", "type": "fuzzy"},
{"label": "WINE", "pattern": "Pinot Noir", "type": "fuzzy"},
{"label": "COUNTRY", "pattern": "Argentina", "type": "fuzzy"},
{"label": "TEST", "pattern": "one-two", "type": "fuzzy"},
]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)
doc = nlp(
"""They sell many Rieslings. My favorite wines are Pinot Nero, Pinot Noir,
and Cape Riesling. I am from Egito. And a one two"""
)
print([ent for ent in doc.ents])
[Rieslings, Pinot Nero, Pinot Noir, Cape Riesling, one two]
The fuzzy matcher will find every highest fuzzy ratio fuzzy match for each pattern. Sometimes patterns will overlap when you're looking through many patterns (if you're searching for "Riesling" and "Cape Riesling", "Riesling" is part of both of those patterns) so as long as the min_r2
(final fuzzy ratio check and defaults to 75) is reached in the matcher it will treat all of these as matches (match ratio of 100 is not prioritized over 75) and will then prioritize longer matches over shorter matches. It does this to give the matcher a chance to pick longer matches with errors over always picking a shorter 100% match. So if you were looking for "Riesling" and "Cape Riesling" and the text had "Cap Riesling" in it and the matcher prioritized ratio, it would pick "Riesling" when "Cap Riesling" should be preferred.
So in the above example we see that "Cape Riesling" has a match ratio >= 75 to "Eggo Riesling" so it is matched because it is longer than just "Riesling". We can enforce stricter matching by modifying the min_r2
either for the whole SpaczzRuler or through pattern kwargs. I will do the latter below:
Test 2: Blank English Model with Kwarg Modifications
import spacy
from spaczz.pipeline import SpaczzRuler
nlp = spacy.blank("en")
ruler = SpaczzRuler(nlp)
patterns = [
{"label": "WINE", "pattern": "Goldriesling", "type": "fuzzy"},
{"label": "WINE", "pattern": "Riesling", "type": "fuzzy"},
{
"label": "WINE",
"pattern": "Eggo Riesling",
"type": "fuzzy",
"kwargs": {"min_r2": 90},
},
{"label": "WINE", "pattern": "Pinot Noir", "type": "fuzzy"},
{"label": "COUNTRY", "pattern": "Argentina", "type": "fuzzy"},
{"label": "TEST", "pattern": "one-two", "type": "fuzzy"},
]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)
doc = nlp(
"""They sell many Rieslings. My favorite wines are Pinot Nero, Pinot Noir,
and Cape Riesling. I am from Egito. And a one two"""
)
print([ent for ent in doc.ents])
[Rieslings, Pinot Nero, Pinot Noir, Riesling, one two]
"Cape Riesling" is no longer matched from "Eggo Riesling" but the "Riesling" part of "Cape Riesling" matches the "Riesling" pattern.
A couple other tips based on your questions. If you want to prioritize exact matches via the spaCy EntityRuler
make sure you add that before the SpaczzRuler
in an nlp pipeline. Earlier in this thread you showed : nlp.add_pipe(spaczzRuler, before='ner'); nlp.add_pipe(ruler, before='ner')
. Instead, put the EntityRuler
first in the pipeline: nlp.add_pipe(spaczzRuler, before="ner"); nlp.add_pipe(ruler, before="spaczzRuler")
Keep in mind the EntityRuler
also prioritizes longer matches over shorter ones but will always pick exact matches.
Lastly be careful with the statistical NER model matches when testing spaczz matches. For example if I use the medium English model instead of a blank English model in the example we've been using:
Test 3: Medium English Model with Kwarg Modifications
import spacy
from spaczz.pipeline import SpaczzRuler
nlp = spacy.load("en_core_web_md")
ruler = SpaczzRuler(nlp)
patterns = [
{"label": "WINE", "pattern": "Goldriesling", "type": "fuzzy"},
{"label": "WINE", "pattern": "Riesling", "type": "fuzzy"},
{
"label": "WINE",
"pattern": "Eggo Riesling",
"type": "fuzzy",
"kwargs": {"min_r2": 90},
},
{"label": "WINE", "pattern": "Pinot Noir", "type": "fuzzy"},
{"label": "COUNTRY", "pattern": "Argentina", "type": "fuzzy"},
{"label": "TEST", "pattern": "one-two", "type": "fuzzy"},
]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler, before="ner")
doc = nlp(
"""They sell many Rieslings. My favorite wines are Pinot Nero, Pinot Noir,
and Cape Riesling. I am from Egito. And a one two"""
)
print([(ent, ent._.spaczz_span) for ent in doc.ents])
[(Rieslings, True), (Pinot Nero, True), (Pinot Noir, True), (Riesling, True), (Egito, False), (one two, True)]
"Egito" is picked up by the model's NER component but is not a match via spaczz.
If you intend to use spaczz as a kind of "spell-checker" I think you'll get better results by turning up the min_r2
argument either for the entire ruler or for certain patterns. You can change the setting for the whole ruler like this: SpaczzRuler(nlp, spaczz_fuzzy_defaults={"min_r2": 90})
. You could also explore using fuzzy regex patterns in spaczz, look at "Approximate 'fuzzy' matching" here.
As a side note min_r1
is mostly a tradeoff between speed and number of evaluations. You may get more matches with a lower min_r1
, but min_r2
is what most strongly controls if a match qualifies or not.
Lastly, the token-based matcher that I will be adding to v0.4.0 will have a syntax similar to spaCy's Matcher
and provides all the same functionality with fuzzy additions (although it runs slower). This may also give you more fine-grained control over your matches.
Wow, thank you for the long and detailed reply. This is incredibly helpful. I need to go back to my code now and modify it with your hints, then I'll reply with my findings.
A couple of remarks:
- I'm using
pt_core_news_sm
. Perhaps this can influence the results. - The text had
Argentina
and the match wasEgito
.
Thanks for the great project. I'll try to devote some time to help you. One-person-projects are the bread and butter of OS.
You're welcome @brunobg !
I tried using the pt_core_new_sm
model on my previous tests, although I am still writing in English (unfortunately, my only language) so the results were a little funky.
import spacy
from spaczz.pipeline import SpaczzRuler
nlp = spacy.load("pt_core_news_sm")
ruler = SpaczzRuler(nlp)
patterns = [
{"label": "WINE", "pattern": "Goldriesling", "type": "fuzzy"},
{"label": "WINE", "pattern": "Riesling", "type": "fuzzy"},
{
"label": "WINE",
"pattern": "Eggo Riesling",
"type": "fuzzy",
"kwargs": {"min_r2": 90},
},
{"label": "WINE", "pattern": "Pinot Noir", "type": "fuzzy"},
{"label": "GPE", "pattern": "Egito", "type": "fuzzy"},
{"label": "TEST", "pattern": "one-two", "type": "fuzzy"},
]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler, before="ner")
doc = nlp(
"""They sell many Rieslings. My favorite wines are Pinot Nero, Pinot Noir,
and Cape Riesling. I am from Argentina. And a one two"""
)
print([(ent, ent._.spaczz_span) for ent in doc.ents])
[(They sell many, False), (Rieslings, True), (Pinot Nero, True), (Pinot Noir, True), (Riesling, True)]
The model ner thinks "They sell many" is an entity and it doesn't capture "one two" but I didn't get a match on "Argentina for "Egito".
The fuzzy matching doesn't match the pattern "one-two" to "one two" in this case because the pt_core_news_sm model does not tokenize "one-two" into three tokens but rather one. By increasing the flex
kwarg to it's max (the len of the pattern, so 1 in this case) we can capture the match. Although this issue stems from a tokenization issue, not a spaczz issue itself.
Decreasing the flex value (min of 0) is another way to control the quality of matches at expense of missing some potential matches (as we just saw).
import spacy
from spaczz.pipeline import SpaczzRuler
nlp = spacy.load("pt_core_news_sm")
ruler = SpaczzRuler(nlp)
patterns = [
{"label": "WINE", "pattern": "Goldriesling", "type": "fuzzy"},
{"label": "WINE", "pattern": "Riesling", "type": "fuzzy"},
{
"label": "WINE",
"pattern": "Eggo Riesling",
"type": "fuzzy",
"kwargs": {"min_r2": 90},
},
{"label": "WINE", "pattern": "Pinot Noir", "type": "fuzzy"},
{"label": "GPE", "pattern": "Egito", "type": "fuzzy"},
{"label": "TEST", "pattern": "one-two", "type": "fuzzy", "kwargs": {"flex": "max"}},
]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler, before="ner")
doc = nlp(
"""They sell many Rieslings. My favorite wines are Pinot Nero, Pinot Noir,
and Cape Riesling. I am from Argentina. And a one two"""
)
print([(ent, ent._.spaczz_span) for ent in doc.ents])
[(They sell many, False), (Rieslings, True), (Pinot Nero, True), (Pinot Noir, True), (Riesling, True), (one two, True)]
Lastly, I wanted to provide you with a little intuition for how the FuzzyMatcher produces matches, which the SpaczzRuler ultimately picks the longest of (along with matches from the Regex and Token matchers and assures no duplicates). So, in your use case, if you prefer prioritizing ratio over length of matches, you could write a callback to incorporate that functionality. As I mentioned before though, you will likely miss out on good, longer matches if part of the phrase matches perfectly. Perhaps some sort of weighting between match ratio and length would give the best results for your use case.
I am going back to using a blank english model and getting rid of kwargs for the sake of illustration.
import spacy
from spaczz.matcher import FuzzyMatcher
nlp = spacy.blank("en")
matcher = FuzzyMatcher(nlp.vocab)
wines = ["Goldriesling", "Riesling", "Eggo Riesling", "Pinot Noir"]
matcher.add("WINE", [nlp(wine) for wine in wines])
matcher.add("GPE", [nlp("Egito")])
matcher.add("TEST", [nlp("one-two")])
doc = nlp(
"""They sell many Rieslings. My favorite wines are Pinot Nero, Pinot Noir,
and Cape Riesling. I am from Argentina. And a one two"""
)
for match_id, start, end, ratio in matcher(doc):
print((doc[start:end], start, end, ratio, match_id))
This produces the following:
(Rieslings, 3, 4, 76, 'WINE') # Match from "Goldriesling"
(Rieslings, 3, 4, 94, 'WINE') # Match from "Riesling" higher ratio would be written to custom attributes in SpaczzRuler
(Pinot Nero, 9, 11, 80, 'WINE') # Match from "Pinot Noir"
(Pinot Noir, 12, 14, 100, 'WINE') # match from "Pinot Noir"
(Cape Riesling, 17, 19, 77, 'WINE') # Longer match from "Eggo Riesling" but we see it barely makes the 75% cutoff.
(Riesling, 18, 19, 100, 'WINE') # match from "Riesling" but the SpaczzRuler will pick the above bc it is longer / makes cutoff.
(Riesling, 18, 19, 80, 'WINE') # Match from "Goldriesling"
(one two, 27, 29, 86, 'TEST') # Match from "one-two"
I'm sure there are opportunities to improve the matching logic and/or provide more options but hopefully that helps for now. I will also note that it is very hard for me to test and code for other languages as I unfortunately only speak English. If you find that Portuguese (or any other language) needs it's own set of considerations, I will need help in accounting for them
Thanks. I think the mismatches I reported are caused by a mix of the pt mode and the size of the pattern list. It's probably not that easy to reproduce in a small, contained test.
I am a bit worried with the performance of spaczz to use it in practice. I could help you to profile and see if there are obvious optimizations, because the ~100x slowdown becomes unfeasible (I was expecting up to ~10x, which would still be fast enough for me; but a 3k character text is taking around 30 seconds with my list of ~1k patterns). I'd need to understand your code to suggest improvements and that might take a while; I don't really have time to develop another OS project, but I can help with tests, suggestions and minor fixes.
But here's some thoughts about that. Could you tell me roughly what is the algorithm in spaczz? My guess is that most of the time comes from comparing neighbor tokens to see if they match. So in theory it seems that if I accept that the tokenization has no errors and all entities are either one word long or won't mix and match with others (so I can never have "Cape Riesling" found for "Riesling") that could be a good optimization at the expense of completely missing a few rarer mistakes (it'd never catch "Gold riesling" if you add "Goldriesling" as a pattern). But it'd work well enough for me if it's fast. It'd be something like this:
for token in doc.tokens:
if token in entitylist: # exact match
entities.append(token, bla bla bla)
else if nearest_match(token, entitylist) >= threshold:
entities.append(token, bla bla bla)
It's a little bit more complicated than that for multi-word entities, but you just match the first term and if it works you check the next one, etc. Now, I don't know what's the complexity and performance of the nearest_match() algorithm you're using (and hopefully it's O(log n), instead of O(n); what is it?), but if I were implementing something like spaczz that'd be my first attempt. This is much simpler from what you're currently doing, of course.
So, is this something you already tried?
Or is this something that can be done with spaczz right now by changing its settings?
Short answer for now as I am trying to wrap up v0.4.0 and the potential changes you're asking for could be patch version changes after that. A combination of ideas you've presented and settings tweaking could definitely optimize things considerably. I'll give a more detailed update later today or tomorrow. Attempting some of these optimizations will be a priority for me after the next release.
Right. Look, I could perhaps help by adding a more "real life example". Load patterns (list of country names could work work well, ~200, easy to find a good one) and a few longer texts. This would be useful to as a harder test to complement the unit ones, since it's more likely to catch a weird combination like #40, and could also be a basic benchmark. Is it something you'd like? If so, let me know about it. I think it could go do a examples
directory instead of tests
?
Hey @brunobg that would be much appreciated and helpful. An examples
directory sounds like a good place to put these as well.
As far as the computational complexity of the search algorithm I'm using, I'm a little embarrassed to say I'm not sure. I familiar with the concepts of O notation but I only started coding a few years ago for a University Data Science program. Most of my engineering skills have been picked up on my own or on the job.
What I do know, now that I've been thinking through the algorithm more, is there are a few glaring inefficiencies I can pretty easily address.
Let's take the pattern, "United Kingdom", for example.
The search, by default, takes two steps. First it iterates through the doc by len(pattern)
- 2 in this case - and compares each two token span back to the pattern. If the match exceeds min_r1
- default 50 - it is passed to further optimization. In optimization the boundaries of the match of moved left +- range(flex)
and right +- range(flex)
. (flex
defaults to max(len(pattern) - 1, 0)
- 1 in this case. It picks the highest ratio match from this optimization that exceeds min_r2
- default is 75.
The algorithm would work fastest with the flex
kwarg set to "min"
or 0
. However, I realized that even in this case I am doing one extra fuzzy comparison that could be refactored away.
In the case of flex > 0, there are three optimizations I could make.
- The extra comparison described above exists in these cases too.
- I will also make a setting that says, during the initial scan through the doc, if a match >= x (probably will default to 100, but will make it user modifiable) we don't care about optimizing that match.
- Instead of flexing the match boundaries in all 4 directions multiple times for patterns with flex > 1, we only do additional boundary moving in the direction that improves the fuzzy ratio.
After these modifications, spaczz still won't be spaCy fast, but I think we'll see a very significant improvement.
Also, the new TokenMatcher
(comparable to spaCy's Matcher
) should be faster as well. In a token pattern it will search for the start of the token match then continue on the pattern. As soon as something doesn't match it moves on. There is no further optimization attempts.
I suggest you read briefly on algorithmic complexity. This is the most important thing when optimizing code. If the algorithm is bad, no good implementation will be able to compete with a slower implementation of a better algorithm.
I think your suggestions are very sound and should make things much faster. spaczz should be always slower than Spacy since a fuzzy comparison will always be slower, but it shouldn't be so slow to be unusable in practice.
I'm a bit swamped here but if you want to keep the discussion going I'll be happy to help. I'll see if I can send that example to help benchmarks.
@brunobg thank you for the examples in #47. I will pull that soon and start experimenting with the examples and using them as benchmarks.
I will also try to spend some time on algorithmic complexity to hopefully further improve spaczz's speed.
I am about to release v0.4.1 which incorporates some of the optimization I discussed above.
- I eliminated the unnecessary extra comparison.
- If
flex > 0
and the newthresh
value is met (default100
) optimization will not be attempted. - Optimization when
flex > 0
could be further improved but for now it will terminate if a round of optimization does not produce a better ratio. I still want to have it only pursue the best single flexing direction after the initial round but that will require more code changes.
See the performance section in the readme for parameter changes that will further increase speed.
Hi @brunobg I know it's been a little while but I wanted to provide some updates. I reformatted some of your examples you provided but the core of them remains the same and they helped me identify a few issues in the SpaczzRuler. I believe I took care of most of the quirks in the SpaczzRuler's matching process in v0.4.2. There are additional improvements to be made but I think I can break these into multiple smaller issues.
- Performance issues and improvements are being tracked in issue #20.
- I don't have an issue open for ease-of-use updates but this is on my to-do list. Doc updates can be covered in #7.
- I think you would like a way to sort SpaczzRuler matches by quality over length or some sort of weighting but this will depend on the completion of #53 and #54. See the "SpaczzRuler Inconsistencies" section of "Known Issues" in the readme.
Let me know if there is anything I am missing. If not I would like to close this and work on the above in more focused issues.
Thanks for your help and interest!
PS I'll provide some more information on your issue #45 separately soon.
Hey! It all looks great. The documentation on the example is very good and clear. Thank you and congratulations!