Compare strings stripping accents/casi sensitive
davidggphy opened this issue · 4 comments
First of all, thanks for the library @gandersen101 . I'm starting using it and it's really powerful.
Using SpaczzRuler
with fuzzy
patterns, by default it compares strings in a case-insensitive way. Is there a way of changing this behaviour?
Similarly, is there a way of comparing strings w/o taking into account accents? This is, making "test" equivalent to "tést". It could be hacked changing the string for a accent-stripped version of it (since it maintains the token structure), but maybe is an easier way.
import sys
import spacy
import spaczz
from spaczz.pipeline import SpaczzRuler
print(f"{sys.version = }")
print(f"{spacy.__version__ = }")
print(f"{spaczz.__version__ = }")
nlp = spacy.blank("en")
fuzzy_ruler = SpaczzRuler(nlp, name="test_ruler")
fuzzy_ruler.add_patterns([{"label" : "TEST",
"pattern" : "test",
"type": "fuzzy",}])
doc = fuzzy_ruler(nlp("this is a test, also THIS IS A TEST, and a tast, we have a TesT, tést, tëst"))
print(f"\nText:\n{doc}\n")
print("Fuzzy Matches:")
for ent in doc.ents:
if ent._.spaczz_type == "fuzzy":
print((ent.text, ent.start, ent.end, ent.label_, ent._.spaczz_ratio))
Output
sys.version = '3.9.0 (default, Nov 15 2020, 06:25:35) \n[Clang 10.0.0 ]'
spacy.version = '3.0.6'
spaczz.version = '0.5.2'Text:
this is a test, also THIS IS A TEST, and a tast, we have a TesT, tést, tëstFuzzy Matches:
('test', 3, 4, 'TEST', 100)
('TEST', 9, 10, 'TEST', 100)
('tast', 13, 14, 'TEST', 75)
('TesT', 18, 19, 'TEST', 100)
('tést', 20, 21, 'TEST', 75)
('tëst', 22, 23, 'TEST', 75)
Hi @davidggphy, thanks for the kind words!
Making the fuzzy matching in spaczz case-sensitive is pretty straightforward. If you're using the SpaczzRuler
you can either control this on the ruler-level or the pattern-level as shown below:
Pattern-Level
import spacy
from spaczz.pipeline import SpaczzRuler
nlp = spacy.blank("en")
text = "testing, TESTING"
doc = nlp(text)
patterns = [
{
"label": "TEST",
"pattern": "testing",
"type": "fuzzy",
"kwargs": {"ignore_case": "False"},
},
{
"label": "TEST",
"pattern": "TESTING",
"type": "fuzzy",
"kwargs": {"ignore_case": "False"},
},
]
ruler = SpaczzRuler(nlp)
ruler.add_patterns(patterns)
doc = ruler(doc)
for ent in doc.ents:
print((ent.text, ent.start, ent.end, ent.label_, ent._.spaczz_ratio))
('testing', 0, 1, 'TEST', 100)
('TESTING', 2, 3, 'TEST', 100)
Ruler-Level
import spacy
from spaczz.pipeline import SpaczzRuler
nlp = spacy.blank("en")
text = "testing, TESTING"
doc = nlp(text)
patterns = [
{
"label": "TEST",
"pattern": "testing",
"type": "fuzzy",
},
{
"label": "TEST",
"pattern": "TESTING",
"type": "fuzzy",
},
]
ruler = SpaczzRuler(nlp, fuzzy_defaults={"ignore_case": False})
ruler.add_patterns(patterns)
doc = ruler(doc)
for ent in doc.ents:
print((ent.text, ent.start, ent.end, ent.label_, ent._.spaczz_ratio))
('testing', 0, 1, 'TEST', 100)
('TESTING', 2, 3, 'TEST', 100)
For handling accents I would recommend two approaches. One is to preprocess your text before running it through spaCy/spazz using a library like Textacy to strip out accents. This will change the text itself before you run it through spaCy/spaczz. This option probably provides the most flexibility but adds another step. The other option is to use one of RapidFuzz's fuzzy matchers that preprocesses text before fuzzy matching but won't actually change the text itself. You can control this in spaczz at the pattern and/or ruler level just like the examples above.
A word of warning though, the default RapidFuzz preprocessor "remov[es] all non alphanumeric characters - trim[s] whitespaces - convert[s] all characters to lower case" according to it's docs. RapidFuzz supports customizing the preprocessing with a custom callable however, spaczz does not currently support passing a custom callable RapidFuzz. I can add this but it'll probably be later next week before I can get to that.
The following RapidFuzz matchers do preprocessing:
- "quick" (essentially the same as the default matcher but does preprocessing)
- "token_set"
- "token_sort"
- "partial_token_set"
- "partial_token_sort"
- "token"
- "partial_token"
- "weighted"
Here's an example of changing the fuzzy matcher on the ruler level:
import spacy
from spaczz.pipeline import SpaczzRuler
nlp = spacy.blank("en")
text = "testing, TESTING"
doc = nlp(text)
patterns = [
{
"label": "TEST",
"pattern": "testing",
"type": "fuzzy",
},
{
"label": "TEST",
"pattern": "TESTING",
"type": "fuzzy",
},
]
ruler = SpaczzRuler(nlp, fuzzy_defaults={"fuzzy_func": "quick"})
ruler.add_patterns(patterns)
doc = ruler(doc)
for ent in doc.ents:
print((ent.text, ent.start, ent.end, ent.label_, ent._.spaczz_ratio))
('testing', 0, 1, 'TEST', 100)
('TESTING', 2, 3, 'TEST', 100)
Hopefully that helps!
Hi @davidggphy did the above adequately answer your question? If you still need/want a feature implemented please let me know and I can track that in this issue, otherwise I will close this issue in the next couple days. Thanks!
Dear @gandersen101 ,
Sorry for my late reply. I tested what you said. Sadly, as you said, RapidFuzz performs preprocessing on the strings, but this does not involve "deaccent". It would be really interesting to add the custom callabale for preprocessing in order to compute the fuzzy scores.
As you said, the other possibility is to preprocess the text before sending it into the matcher, but then the entitities found will be preprocessed accordingly, which is something I would like to prevent. I could hack to later find the same tokens on the original text, but it will be more cumbersome.
Hey @davidggphy, thanks for the additional info. I am planning on doing a spaczz feature upgrade/overhaul in the near future and will keep the ability to add custom preprocessing without modifying the doc in mind. Thanks!