why this didn't support Chinese, what's difficult part ?
Pana opened this issue · 1 comments
Pana commented
FYI
ageitgey commented
Hi @Pana,
The basic algorithm for finding good text is:
- Split text into words
- Count total number of words
- Count number of words that are "stop words" (words that are filler like "the", "and", "or", etc, that occur in real writing)
- If the ratio between stop words and total words is good, this is probably useful text so keep it. Otherwise discard it.
Step 1 is implemented very simply. It just splits words where there spaces between words. For Chinese, that doesn't work at all since there are no spaces. Someone would have to implement a way to split text into words for Chinese to work.