nipunsadvilkar/pySBD

Bug in German splitting with parenthesis

kongyurui opened this issue · 0 comments

Describe the bug
When an open parenthesis appears in certain situations in German text, it can cause a crash when running sentence splitting.

To Reproduce

from pysbd import Segmenter

text = 'auf der Suche nach Einsätzen als Skilehrer im DACH-Raum. Langjährige Erfahrung im Leiten von Gruppen diverser Altersgruppen und Sportarten. B.A. Sport und Gesundheit in Prävention und Therapie (Deutsche Spothochschule Köln) Zertifikate: Erste Hilfe, DRK Rettungsschwimmer silber, DSHS Fitnesstrainer B(asic) Lizenz, Aquafitness Instructor, Progressive Muskelentspannung'

de_split = Segmenter(language='de')

de_split.segment(text)

This crashes at

File "/home/erik/.local/lib/python3.8/site-packages/pysbd/lang/deutsch.py", line 74, in scan_for_replacements
txt = re.sub(r'(?<={am}).(?=\s)'.format(am=am), '∯', txt)

Expected behavior
Segments text

Additional context
Crash due to sequence: B(a

Suggested fix: Add

        am = re.escape(am)

to deutsch.py in scan_for_replacement

Traceback (most recent call last): File "german_fix.py", line 8, in de_split.segment(text) File "/home/erik/.local/lib/python3.8/site-packages/pysbd/segmenter.py", line 87, in segment postprocessed_sents = self.processor(text).process() File "/home/erik/.local/lib/python3.8/site-packages/pysbd/processor.py", line 34, in process self.replace_abbreviations() File "/home/erik/.local/lib/python3.8/site-packages/pysbd/processor.py", line 180, in replace_abbreviations self.text = self.abbreviations_replacer().replace() File "/home/erik/.local/lib/python3.8/site-packages/pysbd/lang/deutsch.py", line 66, in replace self.text = self.search_for_abbreviations_in_string(self.text) File "/home/erik/.local/lib/python3.8/site-packages/pysbd/abbreviation_replacer.py", line 92, in search_for_abbreviations_in_string text = self.scan_for_replacements( File "/home/erik/.local/lib/python3.8/site-packages/pysbd/lang/deutsch.py", line 74, in scan_for_replacements txt = re.sub(r'(?<={am})\.(?=\s)'.format(am=am), '∯', txt) File "/usr/lib/python3.8/re.py", line 210, in sub return _compile(pattern, flags).sub(repl, string, count) File "/usr/lib/python3.8/re.py", line 304, in _compile p = sre_compile.compile(pattern, flags) File "/usr/lib/python3.8/sre_compile.py", line 764, in compile p = sre_parse.parse(p, flags) File "/usr/lib/python3.8/sre_parse.py", line 948, in parse p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0) File "/usr/lib/python3.8/sre_parse.py", line 443, in _parse_sub itemsappend(_parse(source, state, verbose, nested + 1, File "/usr/lib/python3.8/sre_parse.py", line 759, in _parse raise source.error("missing ), unterminated subpattern", re.error: missing ), unterminated subpattern at position 0