Unable to filter out single letter and dot using abbreviation_patterns
comodoro opened this issue · 3 comments
Regular expressions in abbreviation_patterns
intended to filter out sentences containing a single letter and a dot do not do that. The ones I have tried are "\\b\\p{Latin}\\."
, "\\b[^\\W]\\.",
and (this one does not even work for the standalone enumerated letters, see second sentence below) "\\b[bcčdďeéěfghjlmnňpqrřštťwxyýžBCČDĎEĚFGHJLMNŇPQRŘŠTŤWXYÝŽ]\\b"
.
Sentences that make it into the result but shouldn't are for example:
G. H. Bondy mu jde v ústrety.
C D G D. špatně.
Francis J. Mulberry.
Svými začátky sem náleží i O. Fischer.
The whole rule file, renamed to txt for attaching here, is
cs.toml.txt
Sample file with these and other examples:
sample.txt
How are you running the extraction? Both automated tests as well as taking your sample.txt
and using extract-file
none of those sentences get accepted. Are you passing the -l cs
argument correctly? Are you on the latest version of Sentence Extractor?
Sorry, it was very likely a case of appending to the same file. I was using
cargo run -- extract-file -l cs -d /mnt/d/shared/speech/language/all/ >> /mnt/d/shared/speech/language/sentences.txt
as per the README and did not notice the >>
and file growth. I would swear that some of the regular expressions were there from the start, but apparently not. I have just rerun the whole thing (180MB) and the problem seems to not be present.
Oh, I see. Might take a look at the README to see if appending really makes sense in that case. Happy you figured it out :)