/UniDic-COMBO

UniDic2UD + COMBO-pytorch wrapper for spaCy

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

Current PyPI packages

UniDic-COMBO

UniDic2UD + COMBO-pytorch wrapper for spaCy

Basic Usage

>>> import unidic_combo
>>> nlp=unidic_combo.load("kindai")
>>> doc=nlp("澤山居つた兄弟が一疋も見えぬ")
>>> print(unidic_combo.to_conllu(doc))
# text = 澤山居つた兄弟が一疋も見えぬ
1	澤山	沢山	ADV	副詞	_	2	advmod	_	SpaceAfter=No|Translit=タクサン
2	居つ	居る	VERB	動詞-非自立可能	_	4	acl	_	SpaceAfter=No|Translit=オッ
3			AUX	助動詞	_	2	aux	_	SpaceAfter=No|Translit=
4	兄弟	兄弟	NOUN	名詞-普通名詞-一般	_	9	nsubj	_	SpaceAfter=No|Translit=キョウダイ
5			ADP	助詞-格助詞	_	4	case	_	SpaceAfter=No|Translit=
6			NUM	名詞-数詞	_	7	nummod	_	SpaceAfter=No|Translit=イチ
7			NOUN	接尾辞-名詞的-助数詞	_	9	obl	_	SpaceAfter=No|Translit=ピキ
8			ADP	助詞-係助詞	_	7	case	_	SpaceAfter=No|Translit=
9	見え	見える	VERB	動詞-一般	_	0	root	_	SpaceAfter=No|Translit=ミエ
10			AUX	助動詞	_	9	aux	_	SpaceAfter=No|Translit=

>>> import deplacy
>>> deplacy.render(doc,Japanese=True)
澤山 ADV  <══╗     advmod(連用修飾語)
居つ VERB ═╗═╝<acl(連体修飾節)
   AUX  <╝   ║   aux(動詞補助成分)
兄弟 NOUN ═╗═══╝<nsubj(主語)
   ADP  <╝     ║ case(格表示)
   NUM  <╗     ║ nummod(数量による修飾語)
   NOUN ═╝═╗<╗ ║ obl(斜格補語)
   ADP  <══╝ ║ ║ case(格表示)
見え VERB ═╗═══╝═╝ ROOT()
   AUX  <aux(動詞補助成分)

>>> from deplacy.deprelja import deprelja
>>> for b in unidic_combo.bunsetu_spans(doc):
...   for t in b.lefts:
...     print(unidic_combo.bunsetu_span(t),"->",b,"("+deprelja[t.dep_]+")")
...
澤山 -> 居つた (連用修飾語)
居つた -> 兄弟が (連体修飾節)
兄弟が -> 見えぬ (主語)
一疋も -> 見えぬ (斜格補語)

unidic_combo.load(UniDic,BERT=True) loads spaCy Language pipeline for UniDic2UD + COMBO-pytorch. Available UniDic options are:

BERT=True/BERT=False option enables/disables to use bert-base-japanese-whole-word-masking.

Installation for Linux

pip3 install unidic_combo

Installation for Cygwin64

Make sure to get python37-devel python37-pip python37-cython python37-numpy python37-cffi gcc-g++ mingw64-x86_64-gcc-g++ gcc-fortran git curl make cmake libopenblas liblapack-devel libhdf5-devel libfreetype-devel libuv-devel packages, and then:

curl -L https://raw.githubusercontent.com/KoichiYasuoka/UniDic-COMBO/master/cygwin64.sh | sh

Installation for macOS

g++ --version
pip3 install unidic_combo --user
python3 -m spacy download en_core_web_sm --user

If you fail to install Jsonnet, try below before installing UniDic-COMBO:

( echo '#! /bin/sh' ; echo 'exec gcc `echo $* | sed "s/-arch [^ ]*//g"`' ) > /tmp/clang
chmod 755 /tmp/clang
env PATH="/tmp:$PATH" pip3 install jsonnet --user

If you fail to install fugashi, try to install MeCab before installing UniDic-COMBO:

cd /tmp
git clone --depth=1 https://github.com/taku910/mecab
cd mecab/mecab
./configure --with-charset=UTF8
make && sudo make install

Benchmarks

Results of 舞姬/雪國/荒野より-Benchmarks

舞姬 LAS MLAS BLEX
UniDic="kindai" 84.91 77.78 85.19
UniDic="qkana" 83.02 77.78 85.19
UniDic="kinsei" 75.93 67.86 71.43
雪國 LAS MLAS BLEX
UniDic="qkana" 87.50 82.35 78.43
UniDic="kindai" 83.19 78.43 74.51
UniDic="kinsei" 78.57 73.08 69.23
荒野より LAS MLAS BLEX
UniDic="kindai" 78.53 59.46 59.46
UniDic="qkana" 77.49 59.46 59.46
UniDic="kinsei" 76.04 59.46 59.46

Reference