/jumanpp

JUMAN++ (a Japanese Morphological Analyzer)

Primary LanguageC++

JUMAN++


A new morphological analyser that considers semantic plausibility of word sequences by using a recurrent neural network language model (RNNLM).

Notice

This is a branch for the Juman++ rewrite. The original version lives in the master branch. How to compile development version:

Requirements for v2

  • OS: Linux or MacOS X
  • Compiler: C++14 compatible (will downgrade to C++11 later)
  • CMake (v3.1 or later, will probably downgrade)
  • Boost C++ libraries (needed only for old Juman++)
$ ./autogen.sh # Legacy Juman++ is still in the source tree. 
$ ./configure  # Need to create config.h for it.

# Previous two lines need to be run on the first build only

$ mkdir build-dir # CMake does not support in-source builds
$ cd build-dir
$ cmake .. # -DBOOST_ROOT=... if you have boost installed
$ make

Installation (legacy)

Requirement

Build from git

% git clone git@github.com:ku-nlp/jumanpp.git
% wget http://lotus.kuee.kyoto-u.ac.jp/nl-resource/jumanpp/jumanpp-1.01.tar.xz
  (Since this repository does not include resource files, it needs to copy the files
   from ditributed package.)
% tar xJvf jumanpp-1.01.tar.xz
% mv jumanpp-1.01/jumanpp-resource jumanpp/.
% cd jumanpp
% ./autogen.sh
% ./configure 
% make
% sudo make install

Build from package

% wget http://lotus.kuee.kyoto-u.ac.jp/nl-resource/jumanpp/jumanpp-1.01.tar.xz
% tar xJvf jumanpp-1.01.tar.xz
% cd jumanpp-1.01
% ./configure 
% make
% sudo make install

Quick start

% echo "魅力がたっぷりと詰まっている" | jumanpp
魅力 みりょく 魅力 名詞 6 普通名詞 1 * 0 * 0 "代表表記:魅力/みりょく カテゴリ:抽象物"
が が が 助詞 9 格助詞 1 * 0 * 0 NIL
たっぷり たっぷり たっぷり 副詞 8 * 0 * 0 * 0 "自動認識"
と と と 助詞 9 格助詞 1 * 0 * 0 NIL
詰まって つまって 詰まる 動詞 2 * 0 子音動詞ラ行 10 タ系連用テ形 14 "代表表記:詰まる/つまる ドメイン:料理・食事 自他動詞:他:詰める/つめる"
いる いる いる 接尾辞 14 動詞性接尾辞 7 母音動詞 1 基本形 2 "代表表記:いる/いる"
EOS

Option

usage: jumanpp [options] 
options:
  -s, --specifics              lattice format output (unsigned int [=5])
  -B, --beam                   set beam width used in analysis (unsigned int [=5])
  -v, --version                print version
  -h, --help                   print this message

Input

It receives utf-8 encoding text as an input. Lines beginning with # will be interpreted as comment line.

DEMO

DEMO cgi

Model

See ``Morphological Analysis for Unsegmented Languages using Recurrent Neural Network Language Model. Hajime Morita, Daisuke Kawahara, Sadao Kurohashi. EMNLP 2015'' link

Authors

Hajime Morita hmorita@i.kyoto-u.ac.jp
Daisuke Kawahara dk@i.kyoto-u.ac.jp
Sadao Kurohashi kuro@i.kyoto-u.ac.jp

Acknowledgement

Juman++ uses the following open source software/codes:

Note

Install Requirements

libunwind

wget http://download.savannah.gnu.org/releases/libunwind/libunwind-0.99-beta.tar.gz
tar xzf libunwind-0.99-beta.tar.gz
cd libunwind-0.99-beta
./configure --prefix=/somewhere/local/
make 
make install

gperftools

./configure --prefx=/somewhere/local/
make
# When ld try to link libunwind.so (and failed to build), 
# please set an option "UNWIND_LIBS=-lunwind-x86_64" to make.
make install

Boost Libraries

sh bootstrap.sh
./b2 install -j2 --prefix=/somewhere/local/