By-pass the tokenizer

Question

By-pass the tokenizer

wongtaksum opened this issue 3 years ago · 1 comments

Thank you for creating the tool for public use!
I found that the tokenizer does not work well in some occasions. Is there any way to give a delimited input to your POS and dependency parser directly and by-pass your tokenizer?

Answer 1 · 2022-05-30T16:48:30.000Z

By-pass the tokenizer... Well, you can do that, just using ufal.udpipe module directly:

>>> import os,ufal.udpipe,udchinese.udchinese
>>> m=ufal.udpipe.Model.load(os.path.join(udchinese.udchinese.PACKAGE_DIR,"ud-chinese.udpipe"))
>>> udpipe=ufal.udpipe.Pipeline(m,"conllu","","","")
>>> nlp=lambda x:udpipe.process("\n".join("\t".join([str(i+1),j]+["_"]*8) for i,j in enumerate(x.split()))+"\n\n")
>>> doc=nlp("不 入 虎穴 不 得 虎子")
>>> print(doc)
1	不	不	ADV	v,副詞,否定,無界	Polarity=Neg	2	advmod	_	_
2	入	入	VERB	v,動詞,行為,移動	_	0	root	_	_
3	虎穴	虎穴	NOUN	n,名詞,固定物,地形	_	2	obj	_	_
4	不	不	ADV	v,副詞,否定,無界	Polarity=Neg	5	advmod	_	_
5	得	得	VERB	v,動詞,行為,得失	_	2	parataxis	_	_
6	虎子	虎子	NOUN	n,名詞,人,人	_	5	obj	_	_

But it seems to cause several bad effects...