Broken `node.surface`
NoUnique opened this issue · 2 comments
NoUnique commented
Reported by ch.bahk
$ pip list
Package Version
------------ -------
mecab-ko 1.0.0
mecab-ko-dic 1.0.0
import mecab_ko as mecab
SENTENCE = "애플이 영국의 스타트업을 10억 달러에 인수하는 것을 알아보고 있다."
def main():
tagger = mecab.Tagger()
node = tagger.parseToNode(SENTENCE)
while node:
print(f"{node.surface}\t{node.feature}")
node = node.next
if __name__ == "__main__":
main()
Expected output:
BOS/EOS,*,*,*,*,*,*,*
애플 NNP,*,T,애플,*,*,*,*
이 JKS,*,F,이,*,*,*,*
영국 NNP,지명,T,영국,*,*,*,*
의 JKG,*,F,의,*,*,*,*
스타트업 NNG,*,T,스타트업,Compound,*,*,스타트/NNG/*+업/NNG/*
을 JKO,*,T,을,*,*,*,*
10 SN,*,*,*,*,*,*,*
억 NR,*,T,억,*,*,*,*
달러 NNBC,*,F,달러,*,*,*,*
에 JKB,*,F,에,*,*,*,*
인수 NNG,*,F,인수,*,*,*,*
하 XSV,*,F,하,*,*,*,*
는 ETM,*,T,는,*,*,*,*
것 NNB,*,T,것,*,*,*,*
을 JKO,*,T,을,*,*,*,*
알아보 VV,*,F,알아보,*,*,*,*
고 EC,*,F,고,*,*,*,*
있 VX,*,T,있,*,*,*,*
다 EF,*,F,다,*,*,*,*
. SF,*,*,*,*,*,*,*
BOS/EOS,*,*,*,*,*,*,*
Actual results:
- Usually
BOS/EOS,*,*,*,*,*,*,*
Traceback (most recent call last):
...
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce1' in position 1: surrogates not allowed
- Sometimes (Note
node.surface
printed as trash datad:.,V
)
BOS/EOS,*,*,*,*,*,*,*
d:.,V NNP,*,T,애플,*,*,*,*
Traceback (most recent call last):
...
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcbc' in position 2: surrogates not allowed
Seems like memory-related issue, but not sure how to debug/fix.
NoUnique commented
It is an error of mecab due to the garbage collection problem (https://shogo82148.github.io/blog/2015/12/20/mecab-in-python3-final/)
It already patched in mecab (https://github.com/taku910/mecab/pull/24/files) but not applied to mecab-ko yet
It is caused by my wrong policy to manage macab-ko source code, what overwrites these modifications.
I will apply the changes to the next release of mecab-ko and pymecab-ko. Thank you for the detailed report.
Until then, you can avoid the garbage collection problem by adding a line.
import mecab_ko as mecab
SENTENCE = "애플이 영국의 스타트업을 10억 달러에 인수하는 것을 알아보고 있다."
def main():
tagger = mecab.Tagger()
tagger.parse("")
node = tagger.parseToNode(SENTENCE)
while node:
print(f"{node.surface}\t{node.feature}")
node = node.next
if __name__ == "__main__":
main()