Adding POS tagging while building pattern for Spaczzruler
Ibrokhimsadikov opened this issue · 13 comments
Hello, I am really liking Spaczz, to fuzzy match entity patterns.
Quick question is there a way to add a for example POS tagging constraints as well. For example: I want to extract only Noun phrases of AS, but fuzzy match also getting me 'as' from "as above function"
'i' below here is each string from list of vocab to match
{'label': "ECHO", 'pattern': [{'TEXT': i, 'POS': 'NOUN'}], 'type': 'fuzzy'}
Hi @Ibrokhimsadikov, thanks for the kind words. I have gotten behind on spaczz maintenance and improvements lately but am hoping to get back on track in the near future.
I believe implementing some form of POS constraints should be doable but I'm going to half to think about how I actually want to go about it.
I will keep you updated here as that progresses.
Hi @Ibrokhimsadikov, sorry there has not been much visible development on this issue yet. However, I did want to update you on where I am at with thinking/working through this.
The ideal way to add this feature would be adding fuzzy matching support directly into spaCy's matcher, however because much of this is written in Cython, it is beyond my current coding capabilities.
Accordingly, my original thought was to write a Python implementation very similar spaCy's matcher. However this quickly proved to be a massive undertaking that was mostly redundant.
Therefore I think the way I am going to attempt to incorporate this with writing an abstraction that translate these "fuzzy" patterns to spaCy matcher compatible patterns. It would find the fuzzy matches then rewrite the patterns with the verbatim text found. For example:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("The manager gave me acess to the database so now I can acces it.")
pattern = [{"TEXT": {"FUZZY": "access"}, "POS": "NOUN"}]
# AbstractedMatcher.add(pattern)
# Under the hood would find fuzzy matches of "access" in the text and then use those to rewrite patterns
# that are compatible with spaCy's matcher.
[{"TEXT": "acess", "POS": "NOUN"}, {"TEXT": "acces", "POS": "NOUN"}]
# This would then only return the first mispelling of "access" - "acess" as it is the noun form.
This will still take some time to develop but I feel better about this direction.
In the meantime I will post a more obtuse, but still useful, work around you can use in the meantime that makes use of on-match callbacks with the FuzzyMatcher. I should get to that this evening.
Dear @gandersen101,
First of all, thank you so much for not forgetting about me. I am so much grateful for your effort as this is the only library that integrates fuzzy approach. With spaczz I was able to get more entities rather than only using spacy's matcher. As you know, one of the biggest issues in NER is building dictionary/knowledge base which usually comes with different variations of string, or synonyms, which is very time consuming manual effort for custom NER. Spaczz is doing good even though in the expense of memory consumption while running inside spacy pipeline.
Also, AbstractedMatcher is it your custom pipeline similar to Spaczzruler.
Thank you so much, I always check in this repo from time to time to see your updates, Looking forward to your "obtuse" :) solution and I can start testing it as right now I am working with spaczz
Hi @Ibrokhimsadikov thanks for the kind words. I'm very happy that you and others are finding this project useful. I certainly haven't forgotten about this request, I've just had less time than I would like to work on spaczz lately.
The AbstractedMatcher
in the example above is just a placeholder which I will probably end up naming SpaczzMatcher
and I will also incorporate it's functionality into the SpaczzRuler
.
Below is a workaround with the FuzzyMatcher you can use for now. It will only work as expected with single token patterns and the flex argument set to 0. This is definitely a limited solution but you may be able to expand the idea. The eventual SpaczzMatcher
will be much more flexible than this.
import spacy
from spacy.tokens import Span
from spaczz.matcher import FuzzyMatcher
nlp = spacy.load("en_core_web_md")
text = "The manager gave me acess to the database so now I can acces it."
doc = nlp(text)
def add_ent(matcher, doc, i, matches):
"""Callback on match function. Adds entities to doc with name of label."""
# Get the current match and create tuple of entity label, start and end.
# Append entity to the doc's entity. (Don't overwrite doc.ents!)
match_id, start, end, _ratio = matches[i]
entity = Span(doc, start, end, label=match_id)
# If Span already has entity assigned will skip rather than raising exception.
try:
doc.ents += (entity,)
except AttributeError:
pass
def select_nouns(matcher, doc, i, matches):
"""Callback on match function. Will continue passing matches that are nouns."""
# This will only work with single-token patterns.
# Also calling the above callback within this function to add entities to the doc.
match_id, start, _end, _ratio = matches[i]
if doc[start].pos_ == "NOUN":
add_ent(matcher, doc, i, matches)
matcher = FuzzyMatcher(nlp.vocab, flex=0)
# Flex = 0 with single-token patterns will approximate token matching for now.
matcher.add("TEST", [nlp("access")], on_match=select_nouns)
matches = matcher(doc)
# Only the noun version of "access" was added to the doc.
for ent in doc.ents:
print((ent.text, ent.start, ent.end, ent.label_))
('acess', 4, 5, 'TEST')
Hope that helps for now!
Thank you so much for your response I will start using it. Immense thanks
People interested in using the cython source may find this question of interest:
https://stackoverflow.com/questions/65454160/incorporating-fuzzy-search-in-a-matcher-object
Hi @ronyarmon thank you for keeping us updated with your research.
I hope to eventually Cythonize the algorithmic components of spaczz and integrate them with spaCy Vocab objects but that is currently beyond my programming capabilities. It will be a fairly long-term process for me to develop my C/Cython skills enough to accomplish that so if you and/or others are able to accomplish that faster/better than I can you'll certainly have my full support! If the spaCy team decides to implement some of this functionality even better!
Ultimately, I made spaczz to provide features I didn't see anywhere else in the current spaCy ecosystem but I know for sure they could be implemented better than they are now.
In the meantime, I hope to have a new version of spaczz with this requested feature ready in the next couple weeks and will continue to provide updates here.
So as of now I have implementing this feature broken up into 5 distinct elements that I will be working on mostly sequentially.
- Create the algorithm that will search through a
Doc
using token patterns. - Create mapping for output of algorithm to spaCy
Matcher
compatible patterns. - Wrap the algorithm and mapping into the
SpaczzMatcher
. - The
SpaczzMatch
won't be able to return match ratios itself so I will move all match ratio information to custom token/span attributes and properties to keep all the matchers consistent while retaining all desired information. - Integrate the new
SpaczzMatcher
into theSpaczzRuler
.
Pull #35 completes the first task in this list. Hoping to have more done soon!
More progress on this feature. Please see the roadmap below:
- Create the algorithm that will search through a
Doc
using token patterns. - Create mapping for output of algorithm to spaCy
Matcher
compatible patterns. - Wrap the algorithm and mapping into the
SpaczzMatcher
. - The
SpaczzMatch
won't be able to return match ratios itself so I will move all match ratio information to custom token/span attributes and properties to keep all the matchers consistent while retaining all desired information. - Integrate the new
SpaczzMatcher
into theSpaczzRuler
.
I am hoping to have this feature finished this week.
@ronyarmon your stackoverflow question received an interesting response that I will explore in the near future. Seeing that I am close to implementing this feature in my pure-Python way, I will finish this before exploring expanding the spaCy Matcher.
Thank you for sharing that @gandersen101
A few days overdue but this is closed by spaczz v0.4.0. Hopefully you all enjoy it. Please raise an issue if you run into any bugs!
Thank you so much @gandersen101, I will definitely try that. Just FYI, I know it is known fact with speed issues, I want to share my observations: for processing 2mln reports with average of 150words each, it took approximately 20 days to process them, while with entityruler from spacy 3 days, in production with AWS ml.m5.12xlarge notebook instance. For pos Spaczz is amazing, Thank you once again, I will implement POS tagging capability as well.
Hey @Ibrokhimsadikov. Thank you for the speed profiling. Definitely a lot of room for improvement. Issue #41 turns into a performance discussion and I am planning on doing some (hopefully substantial) enhancements very soon. I will also try to keep track of major performance updates in issue #20 over the long-term.
Let me know if you have questions on the token matcher. There is an example in the readme and more in spaczz document tests and test suite.