Ignore roman numbers
Mte90 opened this issue · 10 comments
I started discussing this on cvtools dabinat/cvtools#4 but I think that is better to have this feature there with a parameter.
Roman numbers right now create issues because on spoken they are not used and they need to be managed differently based on the language and they create noise on the detection. Probably is something that can managed by DS but is better to not have this on the sentence right now.
For the record, the corpora creator we run does already treat sentences with roman numerals as invalid: common-voice/CorporaCreator@7896c89
Any reason why this wouldn't be done in the rules file? As far as I see from the discussion over at cvtools, we should not do this generic. And adding a new rule just for that doesn't make a lot of sense to me as there are existing checks that can be used. Feel free to reopen if I completely misunderstood.
I think that have this feature in this tool instead in the blacklist is more simple. In blacklist means add a lot of new words for every language that are always the same, like : II, III, IV and so on.
If the scraper has an option to exclude them will simplify the procedure on cvtools but also for other languages.
My argument was supposed to be "you can already do this, for example through abbreviation_patterns
".
So basically we need to find a regex that can detect roman numbers?
Maybe we can do some tests with https://regexr.com/3a406
This detect roman numbers, if we can do some tests will be perfect.
Maybe also this one https://rgxdb.com/r/1HXJUYQR @MichaelKohler
Maybe we can do some tests with https://regexr.com/3a406
Maybe also this one https://rgxdb.com/r/1HXJUYQR
Both of these do not really work if you remove the ^
and $
. That then matches any word starting with an uppercase letter that is also a valid roman literal.
So is better that someone that can work on this can find the best regex instead me lurking on internet
Given that there is not one way that works for all languages, I'm closing this as languages can do this via abbreviation_patterns
if needed.