skeskinen/bert.cpp

subword `#` should be an option.

FFengIll opened this issue · 6 comments

For bert, there are many models use # for subword symbol, but not all.
Some popular bert-based models defined their own subword symbol.

For example, in e5 the symbol is .

>>> a = '▁'
>>> a.encode('utf-8')
b'\xe2\x96\x81'

Furthermore, there is no such rule to force use #.

In model, the substr symbol always be called as replacement or continuing_subword_prefix.
Actually, it will show in tokenizer.json.

Hi, I was wondering about the subword rules also with regards to #31
I remember trying to get the tokens from the tokenizer, like you did in the PR.
But I also remember having some issue with the subwords when I tried to do this.

Does the code in 31 handle subwords?
Do you have an idea on how to handle models like e5?

Also, unrelated but a thought I had earlier: it would be nice to convert test_tokenizer.cpp to python and run the tests against the reference tokenizers

@skeskinen no, #31 only make vocab not necessary (because it maybe missing).

This issue is another problem for subwords ( I found this since I meet too many unknown token when using e5).

bellow is some token samples in bert-based model.

in m3e, subword is ## like many bert model.

"##a": 8139,
"03": 8140,
"09": 8141,
"08": 8142,
"28": 8143,
"##2": 8144,

in e5, subword is since they trained a new tokenizer (bellow is part copy from tokenizer.json)

      [
        "▁si",
        -7.355116367340088
      ],
      [
        "▁ja",
        -7.370460510253906
      ],
      [
        "▁za",
        -7.37307596206665
      ],
      [
        "▁v",
        -7.385393142700195
      ],

For now, I do not have a good idea for this issue, so I do not implement a PR for it.
Maybe we need to more research and discuss.

For now, I do not have a good idea for this issue, so I do not implement a PR for it. Maybe we need to more research and discuss.

加油,需要跨平台的中英文向量化~ E5 多语言版就不错