ageitgey/node-unfluff

why this didn't support Chinese, what's difficult part ?

Pana opened this issue · 1 comments

Pana commented

FYI

Hi @Pana,

The basic algorithm for finding good text is:

  1. Split text into words
  2. Count total number of words
  3. Count number of words that are "stop words" (words that are filler like "the", "and", "or", etc, that occur in real writing)
  4. If the ratio between stop words and total words is good, this is probably useful text so keep it. Otherwise discard it.

Step 1 is implemented very simply. It just splits words where there spaces between words. For Chinese, that doesn't work at all since there are no spaces. Someone would have to implement a way to split text into words for Chinese to work.