common-voice/cv-sentence-extractor

Unify similar symbols

iLeonidze opened this issue · 2 comments

Currently there is a problem, in Russian sentences there are a lot of similar characters, but they are different. It will be very nice to have an ability to convert multiple similar symbols to single one.
For example we have symbols:

  • - U+002D : HYPHEN-MINUS {hyphen or minus sign}
  • U+2010 : HYPHEN
  • U+2011 : NON-BREAKING HYPHEN
  • U+2013 : EN DASH
  • U+2014 : EM DASH
  • U+2015 : HORIZONTAL BAR {quotation dash}
  • U+2212 : MINUS SIGN

which should be converted to single - U+002D : HYPHEN-MINUS {hyphen or minus sign}

If we implement #9 in a generic way, I think this could be done through that as well, do you agree?

Yeah, seems it will be possible to add conversion rule via this feature