abcd漢字 fails to tokenize

Question

abcd漢字 fails to tokenize

DarrenCook opened this issue 8 years ago · 0 comments

With "abcd漢字" I get 6 tokens, with no POS info. But with "abc漢字" I get 2 tokens, with POS values.

Curiously, when I try it at http://rakuten-nlp.github.io/rakutenma/, both sentences work.

My (node.js) code is like this, using the default model_ja.json file that comes with the tokenizer.

const fs = require('fs');
const RakutenMA = require('rakutenma');
const model = JSON.parse(fs.readFileSync("model_ja.json"));
const rma = new RakutenMA(model);
rma.featset = RakutenMA.default_featset_ja;
rma.hash_func = RakutenMA.create_hash_func(15); //The 15 is to match the pre-trained data.
const s = "abcd漢字";
rma.tokenize(s)

Is there anything I am missing, or doing wrong, here?