nipunsadvilkar/pySBD

Infinite loop?

Opened this issue · 2 comments

The below seems to hang forever-

segmenter = pysbd.Segmenter(language="en", clean=False)
text = "..[111 111 111 111 111 111 111 111 111 111]"
segmenter.segment(text)

Interrupting I get the traceback:

Traceback (most recent call last):
  File "check.py", line 5, in <module>
    segmenter.segment(text)
  File ".../python3.7/site-packages/pysbd/segmenter.py", line 87, in segment
    postprocessed_sents = self.processor(text).process()
  File ".../python3.7/site-packages/pysbd/processor.py", line 37, in process
    self.replace_periods_before_numeric_references()
  File ".../python3.7/site-packages/pysbd/processor.py", line 141, in replace_periods_before_numeric_references
    r"∯\2\r\7", self.text)
  File ".../python3.7/re.py", line 192, in sub
    return _compile(pattern, flags).sub(repl, string, count)
KeyboardInterrupt

this is pysbd version 0.3.3, python 3.7.7

Could it be entering into an infinite loop?

(I found this bug by applying pysbd to wikipedia, on this article: https://en.wikipedia.org/wiki/Clojure it tripped up on "...[484 216 622 139 651 592 379 228 242 355]"

It's due to Catastrophic backtracking in NUMBERED_REFERENCE_REGEX. Need to dug into details

HI @nipunsadvilkar , We faced the same issue with another text.
text = ......[289852000000260698,289852000000260744

Any update on this, please?