nipunsadvilkar/pySBD

destructive behaviour in edge-cases

aflueckiger opened this issue · 5 comments

As of v0.3.3, pySBD shows destructive behavior in some edge-cases even when setting the option clean to False.
When dealing with OCR text, pySBD removes whitespace after multiple periods.

To reproduce

import pysbd

splitter = pysbd.Segmenter(language="fr", clean=False)

text = "Maissen se chargea du reste .. Logiquement,"
print(splitter.segment(text))

text = "Maissen se chargea du reste ... Logiquement,"
print(splitter.segment(text))

text = "Maissen se chargea du reste .... Logiquement,"
print(splitter.segment(text))

Actual output
Please note the missing whitespace after the final period in the example with .. and .....

['Maissen se chargea du reste .', '.', 'Logiquement,']
['Maissen se chargea du reste ... ', 'Logiquement,']
['Maissen se chargea du reste .', '...', 'Logiquement,']

Expected output

['Maissen se chargea du reste .', '. ', 'Logiquement,']
['Maissen se chargea du reste ... ', 'Logiquement,']
['Maissen se chargea du reste .', '... ', 'Logiquement,']

In general, pySBD works well. Many thanks @nipunsadvilkar. I can also look into this as soon as I find some time and open a pull request.

I noticed another destructive behaviour in the context of noisy OCR:

Actual

import pysbd

splitter = pysbd.Segmenter(language="de", clean=False)

splitter.segment("Der Uebel) abe. .hatt' nun geglaubt")
# ['Der Uebel) abe. ', '. ', "hatt' nun geglaubt"]

Note the hallucinated space after the 2nd period.

Expected
['Der Uebel) abe. ', '.', "hatt' nun geglaubt"]

I can look into this in January if this is still open by then.

@aflueckiger Thanks for reporting the bug! Will look into it

Yes, you are more than welcome to send PR if you happen to fix it earlier. Thanks again :)

@nipunsadvilkar I have tried run pySBD on all the sentences in Golden Rules

There are few cases in which it is not segmenting properly.

Hi. I got the destructive behavior too.

To reproduce

import pysbd

splitter = pysbd.Segmenter(language="fr", clean=False, char_span=True)

example_1 = "Phrase 1. Phrase 1."
example_2 = "changement des mentalités.  converger vers l'adhésion ! !"
sent_spans_1 = splitter.segment(example_1)
sent_spans_2 = splitter.segment(example_2)

Actual

# sent_spans_1
[TextSpan(sent='Phrase 1. ', start=0, end=10),
 TextSpan(sent='Phrase 1. ', start=0, end=10)]]
# sent_spans_2
[TextSpan(sent='changement des mentalités.  ', start=0, end=28),
 TextSpan(sent='  ', start=26, end=28),
 TextSpan(sent="converger vers l'adhésion ! ", start=28, end=56),
 TextSpan(sent=' ! ', start=53, end=56)]

Expected

# sent_spans_1
[TextSpan(sent='Phrase 1. ', start=0, end=10),
 TextSpan(sent='Phrase 1.', start=10, end=19)]
# sent_spans_2
[TextSpan(sent='changement des mentalités.', start=0, end=26),
 TextSpan(sent='  ', start=26, end=28),
 TextSpan(sent="converger vers l'adhésion !", start=28, end=55),
 TextSpan(sent=' !', start=55, end=57)]

I may have found a solution. The usage of finditer in Segmenter.sentences_with_char_spans is wrong. I propose

def sentences_with_char_spans(self, sentences: List[str]) -> List[TextSpan]:
    spans: List[TextSpan] = list()
    start = end = 0
    for i_sentence in sentences:
        new_start = self.original_text.find(i_sentence, start)
        if new_start != end:
            spans[-1].end = new_start
            spans[-1].sent = self.original_text[spans[-1].start:new_start]
        end = new_start + len(i_sentence)
        spans.append(TextSpan(i_sentence, new_start, end))
        start = end
    return spans

This code above avoids having non-contiguous spans.
Beside the fact that the sentence segmentation is not "good", the behavior of the above method is restored.

I may be wrong but debugging the code, I found that what is done in sentences_with_char_spans is already done in Processor.sentence_boundary_punctuation. The last statement seems to have the same effect. Instead of returning :

[m.group() for m in re.finditer(self.lang.SENTENCE_BOUNDARY_REGEX, txt)]

you could return:

[(m.group(), *m.span()) for m in re.finditer(self.lang.SENTENCE_BOUNDARY_REGEX, txt)]

I forgot to test one case. As we treat elements from left to right, I thought to test the beginning and the middle of the text but not the end.

The new code :

def sentences_with_char_spans(self, sentences: List[str]) -> List[TextSpan]:
    spans: List[TextSpan] = list()
    start = end = 0
    for i_sentence in sentences:
        new_start = self.original_text.find(i_sentence, start)
        if new_start != end:
            spans[-1].end = new_start
            spans[-1].sent = self.original_text[spans[-1].start:new_start]
        end = new_start + len(i_sentence)
        spans.append(TextSpan(i_sentence, new_start, end))
        start = end
    # the next lines are added
    final = self.original_text[spans[-1].end:]
    if final:
        spans[-1].end += len(final)
        spans[-1].sent += final
    return spans

The new case is:

import pysbd

splitter = pysbd.Segmenter(language="fr", clean=False, char_span=True)

example_3 = "Phrase un. Phrase deux. "  # with a final whitespace
sent_spans_3 = splitter.segment(example)

Actual

# sent_spans_3
[TextSpan(sent='Phrase un. ', start=0, end=11),
 TextSpan(sent='Phrase deux.', start=11, end=23)]]

Expected

# sent_spans_3
[TextSpan(sent='Phrase un. ', start=0, end=11),
 TextSpan(sent='Phrase deux. ', start=11, end=24)]]