Vocabulary file handling
mh-northlander opened this issue · 0 comments
mh-northlander commented
JapaneseWordPieceTokenizer
which we use to build the vocabulary recognizes '\n' (or ' ') as a token.
BertSudachipyTokenizer
however removes them from the tokenization results.
Currently we just ignore those tokens (and problems caused by that (#54)).
-
We may need some error handling on the vocab file corruption.
-
It maybe better to make those tokens used.
In this case we need to prepare a new vocab file format (current txt format cannot handle '\n').
We also need to modify chiTra tokenizer, and reconsider the corpus cleaning processes relating to those tokens. -
In the case we do not use those tokens, we should remove them during vocab building.