Use best practice in English rules file
MichaelKohler opened this issue · 14 comments
Over the past years there were quite a few changes to the possible rule configurations. The English rules file still uses some older rules that should now be replaced by better approaches, such as replacing disallowed_symbols
with allowed_symbols
.
As most new rule files get copied from EN, this would improve overall quality quite substantially.
I agree, especially the documentation in the comments that exists in other files, like the German one, would be very useful.
As you said, I think as a start we should switch from disallowed symbols to allowed_symbols_regex, but we should also collect common English abbreviations, that we want to filter. (like "e.g.", "Mr." "Mrs." "Prof.",...). I found a few useful lists about this and I hope that I can write a proposal in the coming days.
Plus it might be good to take a native English speaker into the loop ;)
Initial PR: https://github.com/common-voice/cv-sentence-extractor/pull/159/files
@stefangrotz apart from the abbreviations, do you have any other ideas?
I would definitely add
matching_symbols = [
["„", "“"],
["(", ")"]
]
and maybe also even_symbols
Nice, that's already done :)
oh, my bad...
I found this explanation in the German file very useful and adapted it a little bit:
# Other patterns
# - Century and others at the beginning of the sentence (circumvents wrongly splitted sentences from WikiExtractor such as "During the 3. Century ..." leading to two incomplete sentences)
# - Sentence delimiter can only be at the end of a sentence
# - No words with only one letter (" a.", " a", " a ", "a ")
# - Mixed upper/lowercase in words (LaSi - mostly chemical elements?)
other_patterns = [
"^(Century|Other Example)",
"[\\.|\\?|!].+$",
"(\\s[A-Za-z]{1}[\\.|\\?|!]*$)|(^[A-Za-z]{1}\\s)|\\s[A-Za-z]{1}\\s",
"[a-z][A-Z][a-z]",
]
One letter words should be allowed in English, though. So we should delete this line or comment it out.
Thanks, I've incorporated half of it.
"^(Century|Other Example)"
.. is AFAICS not used in favor of 3rd etc. If you have others I'm happy to add those though.
(^[A-Za-z]{1}\s)
.. would discard all sentences starting with the article "a", such as "A brown fox jumps over the lazy dog", or any sentence starting with the personal pronon "I", such as "I went to school". Excluding A and I explicitely could make sense, but I'm really not sure if that's actually an issue in general.
(\s[A-Za-z]{1}[\.|\?|!]*$)
This would exclude sentences that end in "I", but that IMHO isn't a problem.
Generally, do you think it would make sense to think about possible abbreviation replacements through replacements
?
I believe that this could be a good thing, since sometimes a word only exists in the form of abbreviations. People rarely write things like "Mister", so replacing "Mr." could improve the dataset. I also believe that this could be extremely beneficial to small languages because replacing abbreviations could also increase the number of extracted sentences.
I've added the replacements. I've also not seen any instances of lowercase sentence starts, so I disallowed that as well.
Any other ideas? If not, I'd say we can do an official review and then create a new blocklist (I would do that outside of this PR though).
Let's do the blocklist in this PR as well, then we only need to verify it once. I'll play around with a few numbers.
/action blocklist en 80
The job failed during the step "generate blocklist", but I don't see why.
Took too long and GitHub killed it (there is a limit). I'm running it locally now.
@stefangrotz I did a review and the error rate is looking good. Can you have a look at #162 and tell me if you have any other suggestions?