ismla-japanese-helper/japanese-helper

add missing very common tokens

verenablaschke opened this issue · 2 comments

Check if the new version of the Wiktionary dump contains the following tokens; otherwise add them:

  • (copula) ✔️
  • する (the current version contains the less often used kanji version) (to do) this is in the new Wiktionary dump

Also make sure that their inflections are still generated correctly (c.f. #9); if necessary, update the switch statement in WiktionaryPreprocessor.readTokenFile. ✔️

  • add kana version of いる・居る ✔️

We could add (common) punctuation marks too (。、・「」 ), so we don't have a guaranteed OOV token in every single sentence.

Punctuation marks (as identified by Kuromoji) now get the meaning/translation [punctuation mark] instead of [out-of-vocabulary] (9a6cf0e).