Exception when clean=True in search_for_connected_sentences
balazik opened this issue · 1 comments
Describe the bug
Segmenter will raise "exception: bad escape (end of pattern) at position" when it is initialized with clean=True and it encounters a sentence like "etc.Png,Jpg,.\" (word/token that contains a backslash).
The exception is raised in:
module:
cleaner.py
class:
class Cleaner
method name:
search_for_connected_sentences
line:
txt = re.sub(re.escape(word), new_word, txt)
To Reproduce
Steps to reproduce the behavior:
# This is a simplified example, the original text contained names so I changed it to img formats
# Word that is a abbreviation with dot followed by upper case letter and backslash
sentencer = pysbd.Segmenter(language="en", clean=True)
txt = "etc.Png,Jpg,.\\"
sentences = sentencer.segment(txt)
Expected behavior
The output should be the same as is, but is should not trow an exception.
Workaround to see the output is to escape the backslash.
sentencer = pysbd.Segmenter(language="en", clean=True)
txt = "etc.Png,Jpg,.\\\\"
sentences = sentencer.segment(txt)
Expected output:
['etc.', 'Png,Jpg,.', '\\']
Possible solution
replace txt = re.sub(re.escape(word), new_word, txt)
with txt = txt.replace(word, new_word)
It avoids all the pitfalls of regular expressions (like escaping), and is generally faster.
Additional context
Originally we parse small text files (in Slovak language) without special treatment to form a huge sentenced corpus. The example was specially crafted just to reproduce the behavior for English parser. I know that the backslash combination is rare for English but it happens to occur in Slovak articles when you process vast amounts of text.
Additional Case:
Also ran into this in spanish text with the string 1.C\
... assume it is the same problem:
re.error: bad escape (end of pattern) at position 4