optimaize/language-detector

Feature to remove minority script content

Closed this issue · 0 comments

When a text is largely written in 1 script (eg Cyrillic), but still contains some of another script (eg Latin), then remove the minority script content as noise.

Make configurable what the limit is in percent.

(Previously, this only allowed to remove ASCII, a subset of Latin).