WorksApplications/SudachiPy

No reading form for certain words

sorami opened this issue · 0 comments

>>> from sudachipy import tokenizer, dictionary
>>> tokenizer_obj = dictionary.Dictionary().create()
>>> [m.reading_form() for m in tokenizer_obj.tokenize("コンピュータ")]
['']
>>> [m.reading_form() for m in tokenizer_obj.tokenize("計算機")]
['ケイサンキ']

It should show the surface when the reading_form does not exist in the lexicon.

e.g., In the original Java implementation - dictionary/WordInfoList.java;

    WordInfo getWordInfo(int wordId) {
        
        ...

        String readingForm = bufferToString(buf);
        if (readingForm.isEmpty()) {
            readingForm = surface;
        }

        ...

    }

Thanks sig_m on the slack channel for reporting this!