/spaCy-ChaPAS

ChaPAS-CaboCha-MeCab wrapper for spaCy

Primary LanguageJupyter NotebookMIT LicenseMIT

Current PyPI packages

spaCy-ChaPAS

ChaPAS-CaboCha-MeCab wrapper for spaCy

Basic Usage

>>> import spacy_chapas
>>> nlp=spacy_chapas.load()
>>> doc=nlp("太郎は花子が読んでいる本を次郎に渡した")
>>> for t in doc:
...   print(t.i,t.orth_,t.lemma_,t.pos_,t.tag_,t.head.i,t.dep_,t.norm_,t.ent_iob_,t.ent_type_)
...
0 太郎 太郎 PROPN 名詞-固有名詞-人名- 12 nsubj タロウ B PERSON
1   ADP 助詞-係助詞 0 case  O
2 花子 花子 PROPN 名詞-固有名詞-人名- 4 nsubj ハナコ B PERSON
3   ADP 助詞-格助詞-一般 2 case  O
4 読ん 読む VERB 動詞-自立 7 acl ヨン O
5   CCONJ 助詞-接続助詞 4 mark  O
6 いる いる AUX 動詞-非自立 4 aux イル O
7   NOUN 名詞-一般 12 obj ホン O
8   ADP 助詞-格助詞-一般 7 case  O
9   NOUN 名詞-一般 10 compound ツギ O
10   NOUN 名詞-一般 12 obl ロウ O
11   ADP 助詞-格助詞-一般 10 case  O
12 渡し 渡す VERB 動詞-自立 12 ROOT ワタシ O
13   AUX 助動詞 12 aux  O
>>> import deplacy
>>> deplacy.render(doc,Japanese=True)
太郎 PROPN ═╗<══════════╗ nsubj(主語)
   ADP   <╝           ║ case(格表示)
花子 PROPN ═╗<╗         ║ nsubj(主語)
   ADP   <╝ ║         ║ case(格表示)
読ん VERB  ═══╝═╗═╗<╗   ║ acl(連体修飾節)
   CCONJ <════╝ ║ ║   ║ mark(標識)
いる AUX   <══════╝ ║   ║ aux(動詞補助成分)
   NOUN  ═╗═══════╝<╗ ║ obj(目的語)
   ADP   <╝         ║ ║ case(格表示)
   NOUN  <╗         ║ ║ compound(複合)
   NOUN  ═╝═╗<╗     ║ ║ obl(斜格補語)
   ADP   <══╝ ║     ║ ║ case(格表示)
渡し VERB  ═╗═══╝═════╝═╝ ROOT()
   AUX   <aux(動詞補助成分)
>>> from deplacy.deprelja import deprelja
>>> for b in spacy_chapas.bunsetu_spans(doc):
...   for t in b.lefts:
...     print(spacy_chapas.bunsetu_span(t),"->",b,"("+deprelja[t.dep_]+")")
...
花子が -> 読んでいる (主語)
読んでいる -> 本を (連体修飾節)
太郎は -> 渡した (主語)
本を -> 渡した (目的語)
次郎に -> 渡した (斜格補語)

spacy_chapas.load(UniDic) loads spaCy Language pipeline for ChaPAS-CaboCha-MeCab. Available UniDic options are:

You can simply use chapas2ud on the command line to get Universal Dependencies:

echo 太郎は花子が読んでいる本を次郎に渡した | chapas2ud -I RAW

Installation for Linux (Debian)

First, install MeCab and necessary packages (including oldstable openjdk-8-jre-headless):

sudo apt update
sudo apt install mecab libmecab-dev mecab-ipadic-utf8 python3-pip python3-dev g++ make curl openjdk-8-jre-headless
pip3 install gdown --user
cd /tmp
curl -L 'https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7QVR6VXJ5dWExSTQ' | tar xzf -
cd CRF++-0.58
./configure --prefix=/usr --libdir=`mecab-config --libs-only-L`
make && sudo make install

Second, install CaboCha:

cd /tmp
gdown 'https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7SDd1Q1dUQkZQaUU'
tar xjf cabocha-0.69.tar.bz2
cd cabocha-0.69
./configure --prefix=/usr --libdir=`mecab-config --libs-only-L` --with-charset=UTF8
make && sudo make install

Third, install ChaPAS:

cd /tmp
gdown 'https://drive.google.com/uc?export=download&id=0BwG_CvJHq43fNDlqSkVSREkzaEk'
tar xzf chapas-0.742.tar.gz
sudo mkdir -p /usr/local/bin
sudo mv chapas-0.742 /usr/local/chapas
( echo '#! /bin/sh' ; echo exec `ls -1 /usr/lib/jvm/j*-1.8.*/bin/java | tail -1` -Xmx1g -jar /usr/local/chapas/chapas.jar '"$@"' ) > chapas
sudo install chapas /usr/local/bin

And last, install spaCy-ChaPAS:

pip3 install spacy_chapas --user

Installation for Linux (Ubuntu)

Same as Debian.

Installation for Linux (Kali)

Same as Debian.

Installation for Linux (CentOS)

First, install MeCab and necessary packages:

sudo yum update
sudo yum install python3-pip python3-devel gcc-c++ make curl bzip2 java-1.8.0-openjdk-headless epel-release
sudo rpm -ivh https://packages.groonga.org/centos/latest/groonga-release-latest.noarch.rpm
sudo yum install mecab mecab-devel mecab-ipadic
pip3 install gdown --user
cd /tmp
curl -L 'https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7QVR6VXJ5dWExSTQ' | tar xzf -
cd CRF++-0.58
./configure --prefix=/usr --libdir=`mecab-config --libs-only-L`
make && sudo make install

Second, third, and last are same as Debian.

Installation for Cygwin64

Make sure to get python37-devel python37-pip python37-cython python37-numpy git gcc-g++, and then:

pip3.7 install git+https://github.com/KoichiYasuoka/chapas-cygwin64
pip3.7 install spacy_chapas

Installation for Google Colaboratory

Try notebook.

Benchmarks

Results of 舞姬/雪國/荒野より-Benchmarks

舞姬 LAS MLAS BLEX
UniDic="kindai" 79.25 59.26 62.96
UniDic="qkana" 77.36 59.26 62.96
UniDic="kinsei" 70.37 53.57 53.57
雪國 LAS MLAS BLEX
UniDic="qkana" 87.50 81.63 77.55
UniDic="kinsei" 85.71 77.55 69.39
UniDic="kindai" 83.19 77.55 73.47
荒野より LAS MLAS BLEX
UniDic="kindai" 68.06 35.14 45.95
UniDic="qkana" 64.92 35.14 45.95
UniDic="kinsei" 64.58 32.43 43.24

Reference