nipunsadvilkar/pySBD

Exception when clean=True in search_for_connected_sentences

balazik opened this issue · 1 comments

Describe the bug
Segmenter will raise "exception: bad escape (end of pattern) at position" when it is initialized with clean=True and it encounters a sentence like "etc.Png,Jpg,.\" (word/token that contains a backslash).

The exception is raised in:
module:
cleaner.py
class:
class Cleaner
method name:
search_for_connected_sentences
line:

txt = re.sub(re.escape(word), new_word, txt)

To Reproduce
Steps to reproduce the behavior:

# This is a simplified example, the original text contained names so I changed it to img formats
# Word that is a abbreviation with dot followed by upper case letter and backslash
sentencer = pysbd.Segmenter(language="en", clean=True)
txt = "etc.Png,Jpg,.\\"
sentences = sentencer.segment(txt)

Expected behavior
The output should be the same as is, but is should not trow an exception.
Workaround to see the output is to escape the backslash.

sentencer = pysbd.Segmenter(language="en", clean=True)
txt = "etc.Png,Jpg,.\\\\"
sentences = sentencer.segment(txt)

Expected output:

['etc.', 'Png,Jpg,.', '\\']

Possible solution
replace txt = re.sub(re.escape(word), new_word, txt)
with txt = txt.replace(word, new_word)
It avoids all the pitfalls of regular expressions (like escaping), and is generally faster.

Additional context
Originally we parse small text files (in Slovak language) without special treatment to form a huge sentenced corpus. The example was specially crafted just to reproduce the behavior for English parser. I know that the backslash combination is rare for English but it happens to occur in Slovak articles when you process vast amounts of text.

Additional Case:

Also ran into this in spanish text with the string 1.C\ ... assume it is the same problem:

re.error: bad escape (end of pattern) at position 4