How well will this handle Chinese?
benjiwheeler opened this issue · 1 comments
benjiwheeler commented
I know that Chinese does not have the same density of spaces as English and most languages; a Chinese character is more analogous to an English word than an English letter.
Would you expect your classifier to treat Chinese characters as letters, or as words?
toonimoadi commented
Depends on your tokenizer.
By default it will tokenize Chinese characters as letters, but you can easily modify it with the following tokenizer
bayes({
tokenizer: function (text) { return text.replace(/\s/g, '').split('') }
})