common-voice/cv-sentence-extractor

Replacements not considering whitespace

HarikalarKutusu opened this issue · 6 comments

I have several settings for abbr in replace array, such as:

  [" sf.", " sayfa"],
  [" dk.", " dakika"],
  [" ör.", " örnek"],

I did put a space in front of them to prevent substring replacement, but somehow it did not prevent that. Looking at the code, which does a simple global string-replace (and Rules are not pre-processed anywhere) I could not pinpoint the reason.

Here is a sample original text part:

..30 mermi kapasiteli yeni bir şarjör. 30 mermilik bu şarjörü...

And this is what it becomes:

..30 mermi kapasiteli yeni bir şarjörnek 30 mermilik bu şarjörü...

There are many samples of similar replacements, recognized them while checking non-dictionary words... Very weird...

I'm sure Rust people got it covered :) Probably, in my case, other replacement items interfere (not that I could find one)... It shouldn't be the nltk code thou, it did not run at that phase.

I'm at the final stage of blacklisting, after that, I'll dive deeper into this.

Yeah, as this is standard Rust behavior (probably a nice exercise on the side to figure out why that is the behavior) I think we'd need to handle this on our side. I definitely can see why we'd want to allow a whitespace in the pattern to not interfere with occurances where it should not be replaced. Potential approaches here to fix this:

  • is there a way to replace only on word boundaries? Is this what we'd eventually want?
  • should replacements additionally support Regex and therefore we could handle these cases with Regex?

I'll leave this open as improvement, I think this would come in useful.

is there a way to replace only on word boundaries? Is this what we'd eventually want?

Not necessarily. In my scripts for Language Models, I had pre-, mid- and, post-replacement needs

pre & post: Mostly cleanup & correction & normalization - no words, full strings.
mid: One or multi-word replacements. E.g.:

  • "21. yy. => yirmi birinci yüzyıl" (to replace centuries)
  • number => string conversion (or back)
  • spelling corrections.

These, with all apostrophe/suffix possibilities!

A second problem here: What is a word boundary - which is valid for all languages? Think of apostrophes or abbreviations for example.

And related rules (e.g. for apostrophes or "a" vs "â" usage change in time) and source quality of those are also questionable (i.e. are the authors & editors of those Wiki articles good enough?). Half of my blacklist is spelling errors.

One could use word replacements for old texts for example (e.g. an author from 75 years ago uses "aceba" which is "acaba" in modern form, which can be handled), but Wiki is definitely chaotic.

should replacements additionally support Regex and therefore we could handle these cases with Regex?

I think this will be a good addition. On the other hand, they are risky to work on a full-scale resource.

The whole replace/regex business is open to errors and every change should be checked & tested extensively. E.g. I encountered the mentioned problem while checking words for blacklisting, some weird ones ending with "...örnek" or "...sayfa".

This did not happen on further tests, it was probably a mistake in my regex formats (changed a lot in time).
You may like to close this - unless you may like to keep it open for the regex option - which would be a great option if someone tests them carefully.

I'm closing this for now, as it seems that now also the playground example has the correct behavior.