subword `#` should be an option.

Question

subword `#` should be an option.

FFengIll opened this issue a year ago · 6 comments

For bert, there are many models use # for subword symbol, but not all.
Some popular bert-based models defined their own subword symbol.

For example, in e5 the symbol is ▁.

>>> a = '▁'
>>> a.encode('utf-8')
b'\xe2\x96\x81'

Answer 1 · 2023-09-14T11:03:27.000Z

Furthermore, there is no such rule to force use #.

Answer 2 · 2023-09-14T11:20:14.000Z

In model, the substr symbol always be called as replacement or continuing_subword_prefix.
Actually, it will show in tokenizer.json.

Answer 3 · 2023-09-18T13:13:22.000Z

Hi, I was wondering about the subword rules also with regards to #31
I remember trying to get the tokens from the tokenizer, like you did in the PR.
But I also remember having some issue with the subwords when I tried to do this.

Does the code in 31 handle subwords?
Do you have an idea on how to handle models like e5?

Also, unrelated but a thought I had earlier: it would be nice to convert test_tokenizer.cpp to python and run the tests against the reference tokenizers

Answer 4 · 2023-09-19T02:05:50.000Z

@skeskinen no, #31 only make vocab not necessary (because it maybe missing).

This issue is another problem for subwords ( I found this since I meet too many unknown token when using e5).

bellow is some token samples in bert-based model.

in m3e, subword is ## like many bert model.

"##a": 8139,
"03": 8140,
"09": 8141,
"08": 8142,
"28": 8143,
"##2": 8144,

in e5, subword is ▁ since they trained a new tokenizer (bellow is part copy from tokenizer.json)

      [
        "▁si",
        -7.355116367340088
      ],
      [
        "▁ja",
        -7.370460510253906
      ],
      [
        "▁za",
        -7.37307596206665
      ],
      [
        "▁v",
        -7.385393142700195
      ],

Answer 5 · 2023-09-19T02:07:27.000Z

For now, I do not have a good idea for this issue, so I do not implement a PR for it.
Maybe we need to more research and discuss.

Answer 6 · 2023-09-19T03:08:12.000Z

For now, I do not have a good idea for this issue, so I do not implement a PR for it. Maybe we need to more research and discuss.

加油，需要跨平台的中英文向量化~ E5 多语言版就不错