How well will this handle Chinese?

Question

How well will this handle Chinese?

benjiwheeler opened this issue 5 years ago · 1 comments

I know that Chinese does not have the same density of spaces as English and most languages; a Chinese character is more analogous to an English word than an English letter.

Would you expect your classifier to treat Chinese characters as letters, or as words?

Answer 1 · 2021-10-21T20:02:41.000Z

Depends on your tokenizer.
By default it will tokenize Chinese characters as letters, but you can easily modify it with the following tokenizer

bayes({
    tokenizer: function (text) { return text.replace(/\s/g, '').split('') }
})