skeskinen/bert.cpp

implement do_handle_chinese_characters in tokenizing

skeskinen opened this issue · 1 comments

As of yet I haven't tried what happens with Chinese/Japanese characters in tokenization. Some special handling is required since these languages don't have spaces between words.

It should be relatively simple to copy existing implementation:

  1. Get inspiration from existing implementation like: https://github.com/huggingface/tokenizers/blob/ef5f50605ddf9f8caef1598c0e4853862b9707a7/tokenizers/src/normalizers/bert.rs#L98
  2. Implement that in bert.cpp -> bert_normalize_prompt
  3. Add some test cases with Asian languages to test_tokenizer.cpp, get the expected results from python Transformers lib tokenizer.

Alternatively:
Replace the whole tokenizer with the huggingface rust implementation? It should probably be at least simplified a little bit, but I would be fine adding some rust code here if it doesn't complicate the build too much.

Another implementation of bert tokenizing: https://github.com/zhihu/cuBERT/blob/master/src/cuBERT/tokenization.cpp
Also, it would probably make sense to move the tokenization tests to python. That way it would be easy to compare with hf-transformers output.