destructive behaviour in edge-cases
aflueckiger opened this issue · 5 comments
As of v0.3.3, pySBD shows destructive behavior in some edge-cases even when setting the option clean
to False
.
When dealing with OCR text, pySBD removes whitespace after multiple periods.
To reproduce
import pysbd
splitter = pysbd.Segmenter(language="fr", clean=False)
text = "Maissen se chargea du reste .. Logiquement,"
print(splitter.segment(text))
text = "Maissen se chargea du reste ... Logiquement,"
print(splitter.segment(text))
text = "Maissen se chargea du reste .... Logiquement,"
print(splitter.segment(text))
Actual output
Please note the missing whitespace after the final period in the example with ..
and ....
.
['Maissen se chargea du reste .', '.', 'Logiquement,']
['Maissen se chargea du reste ... ', 'Logiquement,']
['Maissen se chargea du reste .', '...', 'Logiquement,']
Expected output
['Maissen se chargea du reste .', '. ', 'Logiquement,']
['Maissen se chargea du reste ... ', 'Logiquement,']
['Maissen se chargea du reste .', '... ', 'Logiquement,']
In general, pySBD works well. Many thanks @nipunsadvilkar. I can also look into this as soon as I find some time and open a pull request.
I noticed another destructive behaviour in the context of noisy OCR:
Actual
import pysbd
splitter = pysbd.Segmenter(language="de", clean=False)
splitter.segment("Der Uebel) abe. .hatt' nun geglaubt")
# ['Der Uebel) abe. ', '. ', "hatt' nun geglaubt"]
Note the hallucinated space after the 2nd period.
Expected
['Der Uebel) abe. ', '.', "hatt' nun geglaubt"]
I can look into this in January if this is still open by then.
@aflueckiger Thanks for reporting the bug! Will look into it
Yes, you are more than welcome to send PR if you happen to fix it earlier. Thanks again :)
@nipunsadvilkar I have tried run pySBD on all the sentences in Golden Rules
There are few cases in which it is not segmenting properly.
Hi. I got the destructive behavior too.
To reproduce
import pysbd
splitter = pysbd.Segmenter(language="fr", clean=False, char_span=True)
example_1 = "Phrase 1. Phrase 1."
example_2 = "changement des mentalités. converger vers l'adhésion ! !"
sent_spans_1 = splitter.segment(example_1)
sent_spans_2 = splitter.segment(example_2)
Actual
# sent_spans_1
[TextSpan(sent='Phrase 1. ', start=0, end=10),
TextSpan(sent='Phrase 1. ', start=0, end=10)]]
# sent_spans_2
[TextSpan(sent='changement des mentalités. ', start=0, end=28),
TextSpan(sent=' ', start=26, end=28),
TextSpan(sent="converger vers l'adhésion ! ", start=28, end=56),
TextSpan(sent=' ! ', start=53, end=56)]
Expected
# sent_spans_1
[TextSpan(sent='Phrase 1. ', start=0, end=10),
TextSpan(sent='Phrase 1.', start=10, end=19)]
# sent_spans_2
[TextSpan(sent='changement des mentalités.', start=0, end=26),
TextSpan(sent=' ', start=26, end=28),
TextSpan(sent="converger vers l'adhésion !", start=28, end=55),
TextSpan(sent=' !', start=55, end=57)]
I may have found a solution. The usage of finditer in Segmenter.sentences_with_char_spans is wrong. I propose
def sentences_with_char_spans(self, sentences: List[str]) -> List[TextSpan]:
spans: List[TextSpan] = list()
start = end = 0
for i_sentence in sentences:
new_start = self.original_text.find(i_sentence, start)
if new_start != end:
spans[-1].end = new_start
spans[-1].sent = self.original_text[spans[-1].start:new_start]
end = new_start + len(i_sentence)
spans.append(TextSpan(i_sentence, new_start, end))
start = end
return spans
This code above avoids having non-contiguous spans.
Beside the fact that the sentence segmentation is not "good", the behavior of the above method is restored.
I may be wrong but debugging the code, I found that what is done in sentences_with_char_spans is already done in Processor.sentence_boundary_punctuation. The last statement seems to have the same effect. Instead of returning :
[m.group() for m in re.finditer(self.lang.SENTENCE_BOUNDARY_REGEX, txt)]
you could return:
[(m.group(), *m.span()) for m in re.finditer(self.lang.SENTENCE_BOUNDARY_REGEX, txt)]
I forgot to test one case. As we treat elements from left to right, I thought to test the beginning and the middle of the text but not the end.
The new code :
def sentences_with_char_spans(self, sentences: List[str]) -> List[TextSpan]:
spans: List[TextSpan] = list()
start = end = 0
for i_sentence in sentences:
new_start = self.original_text.find(i_sentence, start)
if new_start != end:
spans[-1].end = new_start
spans[-1].sent = self.original_text[spans[-1].start:new_start]
end = new_start + len(i_sentence)
spans.append(TextSpan(i_sentence, new_start, end))
start = end
# the next lines are added
final = self.original_text[spans[-1].end:]
if final:
spans[-1].end += len(final)
spans[-1].sent += final
return spans
The new case is:
import pysbd
splitter = pysbd.Segmenter(language="fr", clean=False, char_span=True)
example_3 = "Phrase un. Phrase deux. " # with a final whitespace
sent_spans_3 = splitter.segment(example)
Actual
# sent_spans_3
[TextSpan(sent='Phrase un. ', start=0, end=11),
TextSpan(sent='Phrase deux.', start=11, end=23)]]
Expected
# sent_spans_3
[TextSpan(sent='Phrase un. ', start=0, end=11),
TextSpan(sent='Phrase deux. ', start=11, end=24)]]