Ignore roman numbers

Question

Ignore roman numbers

Mte90 opened this issue 5 years ago · 10 comments

I started discussing this on cvtools dabinat/cvtools#4 but I think that is better to have this feature there with a parameter.

Roman numbers right now create issues because on spoken they are not used and they need to be managed differently based on the language and they create noise on the detection. Probably is something that can managed by DS but is better to not have this on the sentence right now.

Answer 1 · 2019-12-30T15:12:24.000Z

For the record, the corpora creator we run does already treat sentences with roman numerals as invalid: common-voice/CorporaCreator@7896c89

Answer 2 · 2020-01-03T17:44:35.000Z

Any reason why this wouldn't be done in the rules file? As far as I see from the discussion over at cvtools, we should not do this generic. And adding a new rule just for that doesn't make a lot of sense to me as there are existing checks that can be used. Feel free to reopen if I completely misunderstood.

Answer 3 · 2020-01-04T19:41:16.000Z

I think that have this feature in this tool instead in the blacklist is more simple. In blacklist means add a lot of new words for every language that are always the same, like : II, III, IV and so on.
If the scraper has an option to exclude them will simplify the procedure on cvtools but also for other languages.

Answer 4 · 2020-01-04T20:09:47.000Z

My argument was supposed to be "you can already do this, for example through abbreviation_patterns".

Answer 5 · 2020-01-06T15:07:56.000Z

So basically we need to find a regex that can detect roman numbers?

Answer 6 · 2020-02-04T21:18:49.000Z

Maybe we can do some tests with https://regexr.com/3a406
This detect roman numbers, if we can do some tests will be perfect.

Answer 7 · 2020-03-03T14:16:45.000Z

Maybe also this one https://rgxdb.com/r/1HXJUYQR @MichaelKohler

Answer 8 · 2020-03-04T17:58:51.000Z

Maybe we can do some tests with https://regexr.com/3a406
Maybe also this one https://rgxdb.com/r/1HXJUYQR

Both of these do not really work if you remove the ^ and $. That then matches any word starting with an uppercase letter that is also a valid roman literal.

Answer 9 · 2020-03-05T11:35:24.000Z

So is better that someone that can work on this can find the best regex instead me lurking on internet

Answer 10 · 2021-10-23T15:04:14.000Z

Given that there is not one way that works for all languages, I'm closing this as languages can do this via abbreviation_patterns if needed.