common-voice/cv-sentence-extractor

Ignore roman numbers

Mte90 opened this issue · 10 comments

Mte90 commented

I started discussing this on cvtools dabinat/cvtools#4 but I think that is better to have this feature there with a parameter.

Roman numbers right now create issues because on spoken they are not used and they need to be managed differently based on the language and they create noise on the detection. Probably is something that can managed by DS but is better to not have this on the sentence right now.

For the record, the corpora creator we run does already treat sentences with roman numerals as invalid: common-voice/CorporaCreator@7896c89

Any reason why this wouldn't be done in the rules file? As far as I see from the discussion over at cvtools, we should not do this generic. And adding a new rule just for that doesn't make a lot of sense to me as there are existing checks that can be used. Feel free to reopen if I completely misunderstood.

Mte90 commented

I think that have this feature in this tool instead in the blacklist is more simple. In blacklist means add a lot of new words for every language that are always the same, like : II, III, IV and so on.
If the scraper has an option to exclude them will simplify the procedure on cvtools but also for other languages.

My argument was supposed to be "you can already do this, for example through abbreviation_patterns".

Mte90 commented

So basically we need to find a regex that can detect roman numbers?

Mte90 commented

Maybe we can do some tests with https://regexr.com/3a406
This detect roman numbers, if we can do some tests will be perfect.

Maybe we can do some tests with https://regexr.com/3a406
Maybe also this one https://rgxdb.com/r/1HXJUYQR

Both of these do not really work if you remove the ^ and $. That then matches any word starting with an uppercase letter that is also a valid roman literal.

Mte90 commented

So is better that someone that can work on this can find the best regex instead me lurking on internet

Given that there is not one way that works for all languages, I'm closing this as languages can do this via abbreviation_patterns if needed.