adapter-hub/hgiyt

Character-tokenized vs subword-tokenized in Japanese

tomohideshibata opened this issue · 1 comments

In Section A.1, the authors say that "we select the character-tokenized Japanese BERT model because it achieved considerably higher scores on preliminary NER fine-tuning evaluations.", but from my experience, a subword-tokenized model is consistently better than a character-based model.

I could reproduce the above result for NER, but the Japanese portion of WikiAnn is character-based, and when we use a subword-tokenized model, the dataset has to be converted to word-based as follows:

ja:高 B-LOC
ja:島 I-LOC
ja:市 I-LOC
ja:周 O
ja:辺 O

ja:高島 B-LOC
ja:市 I-LOC
ja:周辺 O

(I will perform this conversion, and test the subword-tokenized model later.)

All the other datasets are word-based. I have tested the character-tokenized model cl-tohoku/bert-base-japanese-char, which is used in the paper, and subword-tokenized model cl-tohoku/bert-base-japanese (with only one seed (seed = 1)). We can see that the subword-tokenized model is consistently better than the character-based model.

SA UDP (UAS/LAS) POS
Monolingual (paper) 88.0 94.7 / 93.0 98.1
character-tokenized (mine) 88.4 94.8 / 93.1 98.1
subword-tokenized (mine) 91.1 95.0 / 93.4 98.2

It would be great if you could confirm this result.

xplip commented

Hi there, sorry for the very long delay. I finally got around to looking at this. Thanks for taking your time to re-run the experiments and for pointing this out.

I can definitely confirm your results, so it seems like the subword-based model is the better choice out of the two after all. I unfortunately did not account for the conversion from character level to word level in the NER data, which of course should be done before applying the subword-tokenized model to have a really fair comparison—among the two Japanese models but also to mBERT. Therefore, choosing the character-based model solely on the preliminary NER runs without making that conversion was not ideal in hindsight. Note that mBERT uses character-based tokenization for Kanji (not for Hiragana and Katakana, though), so the conversion should have less of an impact on the performance of mBERT than of the subword-tokenized Japanese model.

For what it's worth, it doesn't seem to make a difference which of the two Japanese models we use in the experiments in terms of the conclusions we draw. The subword-based tokenizer yields very low scores in fertility and proportion of continued words (it generally seems to be a more effective tokenizer than both the character-based one and mBERT's), and leads to better overall performance when being trained on the very same data, which corroborates the conclusion that the tokenizer should be well-chosen :).