common-voice/cv-sentence-extractor

Punkt Issue for Indian languages

arijitx opened this issue · 3 comments

https://github.com/Common-Voice/cv-sentence-extractor/blob/master/src/extractor.rs#L127

Punkt tokenizer falls back to English tokenizer with Indian locales, which is raising the issue that it only picks up sentences with “.” ending, which is picking up extremely large part of articles as sentences, is there a way to pass a sentence separator ?

See also #11. Right now there is no config you could pass, but I've been thinking about that a bit in the past few weeks. Might come up with a proof of concept soon.

While splitting by a rule-defined number of terminators might work, I feel this is too naive and could lead to other issues. Languages which use the same characters for terminators and abbreviations for example might run into more issues than feasible. Let's spend more time on #11 to make it work nicely.

Any resolution on this ? I can see that the code sentence extractor still depends on punkt for tokenization :(