implement do_handle_chinese_characters in tokenizing
skeskinen opened this issue · 1 comments
As of yet I haven't tried what happens with Chinese/Japanese characters in tokenization. Some special handling is required since these languages don't have spaces between words.
It should be relatively simple to copy existing implementation:
- Get inspiration from existing implementation like: https://github.com/huggingface/tokenizers/blob/ef5f50605ddf9f8caef1598c0e4853862b9707a7/tokenizers/src/normalizers/bert.rs#L98
- Implement that in bert.cpp -> bert_normalize_prompt
- Add some test cases with Asian languages to test_tokenizer.cpp, get the expected results from python Transformers lib tokenizer.
Alternatively:
Replace the whole tokenizer with the huggingface rust implementation? It should probably be at least simplified a little bit, but I would be fine adding some rust code here if it doesn't complicate the build too much.
Another implementation of bert tokenizing: https://github.com/zhihu/cuBERT/blob/master/src/cuBERT/tokenization.cpp
Also, it would probably make sense to move the tokenization tests to python. That way it would be easy to compare with hf-transformers output.