Vocabulary file handling

Question

Vocabulary file handling

mh-northlander opened this issue 2 years ago · 0 comments

JapaneseWordPieceTokenizer which we use to build the vocabulary recognizes '\n' (or ' ') as a token.
BertSudachipyTokenizer however removes them from the tokenization results.
Currently we just ignore those tokens (and problems caused by that (#54)).

We may need some error handling on the vocab file corruption.
It maybe better to make those tokens used.
In this case we need to prepare a new vocab file format (current txt format cannot handle '\n').
We also need to modify chiTra tokenizer, and reconsider the corpus cleaning processes relating to those tokens.
In the case we do not use those tokens, we should remove them during vocab building.