add missing very common tokens
verenablaschke opened this issue · 2 comments
verenablaschke commented
Check if the new version of the Wiktionary dump contains the following tokens; otherwise add them:
- だ (copula) ✔️
する (the current version contains the less often used kanji version) (to do)this is in the new Wiktionary dump
Also make sure that their inflections are still generated correctly (c.f. #9); if necessary, update the switch statement in WiktionaryPreprocessor.readTokenFile
. ✔️
- add kana version of いる・居る ✔️
verenablaschke commented
We could add (common) punctuation marks too (。、・「」
), so we don't have a guaranteed OOV token in every single sentence.
verenablaschke commented
Punctuation marks (as identified by Kuromoji) now get the meaning/translation [punctuation mark]
instead of [out-of-vocabulary]
(9a6cf0e).