ttezel/bayes

How well will this handle Chinese?

benjiwheeler opened this issue · 1 comments

I know that Chinese does not have the same density of spaces as English and most languages; a Chinese character is more analogous to an English word than an English letter.

Would you expect your classifier to treat Chinese characters as letters, or as words?

Depends on your tokenizer.
By default it will tokenize Chinese characters as letters, but you can easily modify it with the following tokenizer

bayes({
    tokenizer: function (text) { return text.replace(/\s/g, '').split('') }
})