common-voice/cv-sentence-extractor

max_characters in rules

HarikalarKutusu opened this issue · 3 comments

I'm trying to make this work with Turkish.
I see the following:

min_trimmed_length = 
min_word_count = 
max_word_count = 
min_characters = 

If I'm not mistaken, there is no max_characters setting. Like in German, Turkish words have a high variance in length due to the agglutinative nature of the language. So, a 5-6 word sentence can be quite long while reading.

I've been also planning to change the sentence-collector rules to use charter length instead of words, but I can see that it is missing here.

If this is true, can this be added?

I think that sounds reasonable to be implemented. I'd suggest to implement this the same way as min_characters, including a test for it and documentation in the README.

Thanks for bringing this up.

OK, I'll do it tomorrow on a clean clone (I've been messing with the current one)...

Implemented with #183