NoUnique/pymecab-ko

Broken `node.surface`

NoUnique opened this issue · 2 comments

Reported by ch.bahk

$ pip list

Package      Version
------------ -------
mecab-ko     1.0.0
mecab-ko-dic 1.0.0
import mecab_ko as mecab

SENTENCE = "애플이 영국의 스타트업을 10억 달러에 인수하는 것을 알아보고 있다."


def main():
    tagger = mecab.Tagger()

    node = tagger.parseToNode(SENTENCE)

    while node:
        print(f"{node.surface}\t{node.feature}")

        node = node.next


if __name__ == "__main__":
    main()

Expected output:

        BOS/EOS,*,*,*,*,*,*,*
애플    NNP,*,T,애플,*,*,*,*
이      JKS,*,F,이,*,*,*,*
영국    NNP,지명,T,영국,*,*,*,*
의      JKG,*,F,의,*,*,*,*
스타트업        NNG,*,T,스타트업,Compound,*,*,스타트/NNG/*+업/NNG/*
을      JKO,*,T,을,*,*,*,*
10      SN,*,*,*,*,*,*,*
억      NR,*,T,억,*,*,*,*
달러    NNBC,*,F,달러,*,*,*,*
에      JKB,*,F,에,*,*,*,*
인수    NNG,*,F,인수,*,*,*,*
하      XSV,*,F,하,*,*,*,*
는      ETM,*,T,는,*,*,*,*
것      NNB,*,T,것,*,*,*,*
을      JKO,*,T,을,*,*,*,*
알아보  VV,*,F,알아보,*,*,*,*
고      EC,*,F,고,*,*,*,*
있      VX,*,T,있,*,*,*,*
다      EF,*,F,다,*,*,*,*
.       SF,*,*,*,*,*,*,*
        BOS/EOS,*,*,*,*,*,*,*

Actual results:

  • Usually
        BOS/EOS,*,*,*,*,*,*,*
Traceback (most recent call last):
...
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce1' in position 1: surrogates not allowed
  • Sometimes (Note node.surface printed as trash data d:.,V)
        BOS/EOS,*,*,*,*,*,*,*
d:.,V   NNP,*,T,애플,*,*,*,*
Traceback (most recent call last):
...
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcbc' in position 2: surrogates not allowed

Seems like memory-related issue, but not sure how to debug/fix.

It is an error of mecab due to the garbage collection problem (https://shogo82148.github.io/blog/2015/12/20/mecab-in-python3-final/)

It already patched in mecab (https://github.com/taku910/mecab/pull/24/files) but not applied to mecab-ko yet
It is caused by my wrong policy to manage macab-ko source code, what overwrites these modifications.

I will apply the changes to the next release of mecab-ko and pymecab-ko. Thank you for the detailed report.
Until then, you can avoid the garbage collection problem by adding a line.

import mecab_ko as mecab

SENTENCE = "애플이 영국의 스타트업을 10억 달러에 인수하는 것을 알아보고 있다."


def main():
    tagger = mecab.Tagger()
    tagger.parse("")

    node = tagger.parseToNode(SENTENCE)

    while node:
        print(f"{node.surface}\t{node.feature}")

        node = node.next


if __name__ == "__main__":
    main()

This bug is fixed in v1.0.1