common-voice/cv-sentence-extractor

Sentence-level lookbehind rule

somerandomguyontheweb opened this issue · 3 comments

I'm working on sentence extraction from Belarusian Wikipedia, and I see quite a few examples like this:

Таксама ў царкве захоўваюцца часціца мошчаў св. Серафіма Сароўскага і святыя мошчы сарака угоднікаў Божых.

'The church also stores a particle of the relics of St. Seraphim of Sarov and the sacred relics of forty saints.'

Here св. 'St.' is an abbreviation, expanded as святога (but in other contexts it can also be expanded as святы, святая and so on). I've disabled this abbreviation in the rules file be.toml, but now I get this sentence extracted:

Серафіма Сароўскага і святыя мошчы сарака угоднікаў Божых.

'Seraphim of Sarov and the sacred relics of forty saints.'

Although it begins with a capital letter, it isn't a well-formed sentence, and we probably don't want to extract it.

Is it somehow possible to specify a "lookbehind" rule, so that a sentence would be dropped when the previous sentence (in the original text) ends with a specific pattern, such as св.?

Thanks for filing this issue. Here's how the extraction works in general:

  • Read in all the text for all the articles from a chunk file -> we have all sentences per article
  • For each article's text, we split the full text into sentences using punkt
  • We apply the replacements defined in the replacements rules
  • We choose 3 sentences at random from each article where the rules tell us this is a valid sentence
  • We repeat that for all the chunk files generated by WikiExtractor

So the splitting into sentences and applying the rules file is independent of each other. For this scenario here I see two options:

Option 1

Those specific sentence parts might occur multiple times. Once the full sentence is returned by the sentence splitting (punkt), and once only the second part after "St." is returned, due to rust-punkt not being super good at splitting sentences. That would be #11

Option 2

The sentence only appears once, but in the second run rust-punkt splitted the sentence differently than the first time (which would make the whole thing even worse). This would be #11 too.

I honestly don't see how the rules would impact how we split up in sentences, as we only apply the rules once everything was splitted. So here's what I think what happened here:

  • Initially the sentence got splitted into two sentences, as if the period for св. was a sentence terminator
  • Without the rule both sentences were valid
  • After applying the new rule, only the second sentence part was valid

Did you see anything that would contradict this statement? Otherwise I'd close this as duplicate of #11 .

Thank you @MichaelKohler for the detailed explanation – I think your summary is accurate. The issue that I've reported appears to be a particular case of #11: even if we set needs_uppercase_start = true and define a reasonable list of abbreviations that cannot be sentence-final, we will still get segments that follow an abbreviation and start with an uppercase letter (so that the original poster's suggestion wouldn't work). My point is that adding a new rule type for sentence-level lookbehind would allow us to solve exactly this issue, i.e., decide that the second part is actually not valid by looking at the text that precedes it. That's just an idea – I understand that it probably wouldn't be trivial to implement, given that sentence splitting and rules application are independent.

Let's try to fix #11 first, and then we can see if this here is still needed. I'll reopen if so.