common-voice/cv-sentence-extractor

Unable to filter out single letter and dot using abbreviation_patterns

comodoro opened this issue · 3 comments

Regular expressions in abbreviation_patterns intended to filter out sentences containing a single letter and a dot do not do that. The ones I have tried are "\\b\\p{Latin}\\.", "\\b[^\\W]\\.", and (this one does not even work for the standalone enumerated letters, see second sentence below) "\\b[bcčdďeéěfghjlmnňpqrřštťwxyýžBCČDĎEĚFGHJLMNŇPQRŘŠTŤWXYÝŽ]\\b".

Sentences that make it into the result but shouldn't are for example:
G. H. Bondy mu jde v ústrety.
C D G D. špatně.
Francis J. Mulberry.
Svými začátky sem náleží i O. Fischer.

The whole rule file, renamed to txt for attaching here, is
cs.toml.txt

Sample file with these and other examples:
sample.txt

How are you running the extraction? Both automated tests as well as taking your sample.txt and using extract-file none of those sentences get accepted. Are you passing the -l cs argument correctly? Are you on the latest version of Sentence Extractor?

Sorry, it was very likely a case of appending to the same file. I was using

cargo run -- extract-file -l cs -d /mnt/d/shared/speech/language/all/ >> /mnt/d/shared/speech/language/sentences.txt

as per the README and did not notice the >> and file growth. I would swear that some of the regular expressions were there from the start, but apparently not. I have just rerun the whole thing (180MB) and the problem seems to not be present.

Oh, I see. Might take a look at the README to see if appending really makes sense in that case. Happy you figured it out :)