nipunsadvilkar/pySBD

Shouldn't colons cause a sentence split?

RuABraun opened this issue · 2 comments

They currently don't:

>>> s = 'Tomorrow I will do the greatest thing ever: Become a god.'
>>> seg.segment(s)
['Tomorrow I will do the greatest thing ever: Become a god.']
>>> s = 'The best player of the city: Zob Ahan F.C. and Sepahan F.C..'
>>> seg.segment(s)
['The best player of the city: Zob Ahan F.C. and Sepahan F.C..']

@RuABraun: not by default design choice of pysbd & pragmatic_segmenter . Still, if you wish to add : then update it at the of SENTENCE_BOUNDARY_REGEX - \S.*?[。..:!!??ȸȹ☉☈☇☄].

SENTENCE_BOUNDARY_REGEX = r"((?:[^)])*)(?=\s?[A-Z])|「(?:[^」])*」(?=\s[A-Z])|\((?:[^\)]){2,}\)(?=\s[A-Z])|\'(?:[^\'])*[^,]\'(?=\s[A-Z])|\"(?:[^\"])*[^,]\"(?=\s[A-Z])|\“(?:[^\”])*[^,]\”(?=\s[A-Z])|[。..!!??].*|\S.*?[。..!!??ȸȹ☉☈☇☄]"

Thanks