By-pass the tokenizer
wongtaksum opened this issue · 1 comments
wongtaksum commented
Thank you for creating the tool for public use!
I found that the tokenizer does not work well in some occasions. Is there any way to give a delimited input to your POS and dependency parser directly and by-pass your tokenizer?
KoichiYasuoka commented
By-pass the tokenizer... Well, you can do that, just using ufal.udpipe
module directly:
>>> import os,ufal.udpipe,udchinese.udchinese
>>> m=ufal.udpipe.Model.load(os.path.join(udchinese.udchinese.PACKAGE_DIR,"ud-chinese.udpipe"))
>>> udpipe=ufal.udpipe.Pipeline(m,"conllu","","","")
>>> nlp=lambda x:udpipe.process("\n".join("\t".join([str(i+1),j]+["_"]*8) for i,j in enumerate(x.split()))+"\n\n")
>>> doc=nlp("不 入 虎穴 不 得 虎子")
>>> print(doc)
1 不 不 ADV v,副詞,否定,無界 Polarity=Neg 2 advmod _ _
2 入 入 VERB v,動詞,行為,移動 _ 0 root _ _
3 虎穴 虎穴 NOUN n,名詞,固定物,地形 _ 2 obj _ _
4 不 不 ADV v,副詞,否定,無界 Polarity=Neg 5 advmod _ _
5 得 得 VERB v,動詞,行為,得失 _ 2 parataxis _ _
6 虎子 虎子 NOUN n,名詞,人,人 _ 5 obj _ _
But it seems to cause several bad effects...