Possible infinite loop
brunobg opened this issue · 3 comments
Running my tests with spaczz@master they seem to get into an infinite loop at the nlp()
call. Stack dumps:
File "/usr/lib64/python3.8/site-packages/spacy/language.py", line 445, in __call__
doc = proc(doc, **component_cfg.get(name, {}))
File "/usr/lib64/python3.8/site-packages/spaczz/pipeline/spaczzruler.py", line 150, in __call__
for fuzzy_match in self.fuzzy_matcher(doc):
File "/usr/lib64/python3.8/site-packages/spaczz/matcher/_phrasematcher.py", line 103, in __call__
matches_wo_label = self._searcher.match(doc, pattern, **kwargs)
File "/usr/lib64/python3.8/site-packages/spaczz/search/_phrasesearcher.py", line 133, in match
matches_w_nones = [
File "/usr/lib64/python3.8/site-packages/spaczz/search/_phrasesearcher.py", line 134, in <listcomp>
self._optimize(
File "/usr/lib64/python3.8/site-packages/spaczz/search/_phrasesearcher.py", line 217, in _optimize
r = self.compare(query, doc[bp_l:bp_r], *args, **kwargs)
File "doc.pyx", line 308, in spacy.tokens.doc.Doc.__getitem__
File "/usr/lib64/python3.8/site-packages/spacy/util.py", line 491, in normalize_slice
if not (step is None or step == 1):
another ctrl-c during another run:
self._doc = nlp(text)
File "/usr/lib64/python3.8/site-packages/spacy/language.py", line 445, in __call__
doc = proc(doc, **component_cfg.get(name, {}))
File "/usr/lib64/python3.8/site-packages/spaczz/pipeline/spaczzruler.py", line 150, in __call__
for fuzzy_match in self.fuzzy_matcher(doc):
File "/usr/lib64/python3.8/site-packages/spaczz/matcher/_phrasematcher.py", line 103, in __call__
matches_wo_label = self._searcher.match(doc, pattern, **kwargs)
File "/usr/lib64/python3.8/site-packages/spaczz/search/_phrasesearcher.py", line 133, in match
matches_w_nones = [
File "/usr/lib64/python3.8/site-packages/spaczz/search/_phrasesearcher.py", line 134, in <listcomp>
self._optimize(
File "/usr/lib64/python3.8/site-packages/spaczz/search/_phrasesearcher.py", line 205, in _optimize
rl = self.compare(query, doc[p_l : p_r - f], *args, **kwargs)
File "/usr/lib64/python3.8/site-packages/spaczz/search/fuzzysearcher.py", line 109, in compare
return round(self._fuzzy_funcs.get(fuzzy_func)(a_text, b_text))
Hi @brunobg, this is concerning but hard to diagnose when the information at hand. If there is any way you could pinpoint what pattern(s)/doc(s) combinations are causing this that would be extremely helpful. Spaczz is well coverage tested and I have used it on the job on medical texts but new issues will always come up as people apply spaczz in new settings.
One thing to keep in mind is that spaczz can be extremely slow given a large enough pattern list and document(s). I explain why this is and why it is beyond my capabilities to significantly speed up spaczz in the short-term in issue #20. Not saying that is what is happening here but keep that in mind as well.
This happens only in one specific test, so I can probably isolate the pattern like I did before. It has been "fast enough" on every other test, which is why I think it's an infinite loop. Other tests take milliseconds, this one is still going after 10 seconds. Speed is not an issue for me within reasonable times.
I read #20 and it makes sense to me (though running it through a profiler would help to pinpoint where exact it takes too long).
Closing this. You're right, it just takes long (~100 time longer than scrapy NER).