SudachiPy is a Python version of Sudachi, a Japanese morphological analyzer.
$ pip install sudachipy sudachidict_core
$ echo "高輪ゲートウェイ駅" | sudachipy
高輪ゲートウェイ駅 名詞,固有名詞,一般,*,*,* 高輪ゲートウェイ駅
EOS
$ echo "高輪ゲートウェイ駅" | sudachipy -m A
高輪 名詞,固有名詞,地名,一般,*,* 高輪
ゲートウェイ 名詞,普通名詞,一般,*,*,* ゲートウェー
駅 名詞,普通名詞,一般,*,*,* 駅
EOS
$ echo "空缶空罐空きカン" | sudachipy -a
空缶 名詞,普通名詞,一般,*,*,* 空き缶 空缶 アキカン 0
空罐 名詞,普通名詞,一般,*,*,* 空き缶 空罐 アキカン 0
空きカン 名詞,普通名詞,一般,*,*,* 空き缶 空きカン アキカン 0
EOS
You need SudachiPy and a dictionary.
$ pip install sudachipy
You can get dictionary as a Python package. It make take a while to download the dictionary file (around 70MB for the core
edition).
$ pip install sudachidict_core
Alternatively, you can choose other dictionary editions. See this section for the detail.
There is a CLI command sudachipy
.
$ echo "外国人参政権" | sudachipy
外国人参政権 名詞,普通名詞,一般,*,*,* 外国人参政権
EOS
$ echo "外国人参政権" | sudachipy -m A
外国 名詞,普通名詞,一般,*,*,* 外国
人 接尾辞,名詞的,一般,*,*,* 人
参政 名詞,普通名詞,一般,*,*,* 参政
権 接尾辞,名詞的,一般,*,*,* 権
EOS
$ sudachipy tokenize -h
usage: sudachipy tokenize [-h] [-r file] [-m {A,B,C}] [-o file] [-a] [-d] [-v]
[file [file ...]]
Tokenize Text
positional arguments:
file text written in utf-8
optional arguments:
-h, --help show this help message and exit
-r file the setting file in JSON format
-m {A,B,C} the mode of splitting
-o file the output file
-a print all of the fields
-d print the debug information
-v, --version print sudachipy version
Columns are tab separated.
- Surface
- Part-of-Speech Tags (comma separated)
- Normalized Form
When you add the -a
option, it additionally outputs
- Dictionary Form
- Reading Form
- Dictionary ID
0
for the system dictionary1
and above for the user dictionaries-1\t(OOV)
if a word is Out-of-Vocabulary (not in the dictionary)
$ echo "外国人参政権" | sudachipy -a
外国人参政権 名詞,普通名詞,一般,*,*,* 外国人参政権 外国人参政権 ガイコクジンサンセイケン 0
EOS
echo "阿quei" | sudachipy -a
阿 名詞,普通名詞,一般,*,*,* 阿 阿 -1 (OOV)
quei 名詞,普通名詞,一般,*,*,* quei quei -1 (OOV)
EOS
Here is an example;
from sudachipy import tokenizer
from sudachipy import dictionary
tokenizer_obj = dictionary.Dictionary().create()
# Multi-granular Tokenization
mode = tokenizer.Tokenizer.SplitMode.C
[m.surface() for m in tokenizer_obj.tokenize("国家公務員", mode)]
# => ['国家公務員']
mode = tokenizer.Tokenizer.SplitMode.B
[m.surface() for m in tokenizer_obj.tokenize("国家公務員", mode)]
# => ['国家', '公務員']
mode = tokenizer.Tokenizer.SplitMode.A
[m.surface() for m in tokenizer_obj.tokenize("国家公務員", mode)]
# => ['国家', '公務', '員']
# Morpheme information
m = tokenizer_obj.tokenize("食べ", mode)[0]
m.surface() # => '食べ'
m.dictionary_form() # => '食べる'
m.reading_form() # => 'タベ'
m.part_of_speech() # => ['動詞', '一般', '*', '*', '下一段-バ行', '連用形-一般']
# Normalization
tokenizer_obj.tokenize("附属", mode)[0].normalized_form()
# => '付属'
tokenizer_obj.tokenize("SUMMER", mode)[0].normalized_form()
# => 'サマー'
tokenizer_obj.tokenize("シュミレーション", mode)[0].normalized_form()
# => 'シミュレーション'
(With 20200330
core
dictionary. The results may change when you use other versions)
There are three editions of Sudachi Dictionary, namely, small
, core
, and full
. See WorksApplications/SudachiDict for the detail.
SudachiPy uses sudachidict_core
by default. You can specify the dictionary with the link -t
command.
$ pip install sudachidict_small
$ sudachipy link -t small
$ pip install sudachidict_full
$ sudachipy link -t full
You can remove the dictionary link with the link -u
commnad.
$ sudachipy link -u
Dictionaries are installed as Python packages sudachidict_small
, sudachidict_core
, and sudachidict_full
. SudachiPy tries to refer sudachidict
package to use a dictionary. The link
subcommand creates a symbolic link of sudachidict_*
as sudachidict
, to switch the packages.
The dictionary files are not in the package itself, but it is downloaded upon installation.
Alternatively, if the dictionary file is specified in the setting file, sudachi.json
, SudachiPy will use that file.
{
"systemDict" : "relative/path/to/system.dic",
...
}
The default setting file is sudachipy/resources/sudachi.json. You can specify your sudachi.json
with the -r
option.
$ sudachipy -r path/to/sudachi.json
To use a user dictionary, user.dic
, place sudachi.json to anywhere you like, and add userDict
value with the relative path from sudachi.json
to your user.dic
.
{
"userDict" : ["relative/path/to/user.dic"],
...
}
Then specify your sudachi.json
with the -r
option.
$ sudachipy -r path/to/sudachi.json
You can build a user dictionary with the subcommand ubuild
.
WARNING: v0.3.* ubuild contains bug.
$ sudachipy ubuild -h
usage: sudachipy ubuild [-h] [-d string] [-o file] [-s file] file [file ...]
Build User Dictionary
positional arguments:
file source files with CSV format (one or more)
optional arguments:
-h, --help show this help message and exit
-d string description comment to be embedded on dictionary
-o file output file (default: user.dic)
-s file system dictionary (default: linked system_dic, see link -h)
About the dictionary file format, please refer to this document (written in Japanese, English version is not available yet).
$ sudachipy build -h
usage: sudachipy build [-h] [-o file] [-d string] -m file file [file ...]
Build Sudachi Dictionary
positional arguments:
file source files with CSV format (one of more)
optional arguments:
-h, --help show this help message and exit
-o file output file (default: system.dic)
-d string description comment to be embedded on dictionary
required named arguments:
-m file connection matrix file with MeCab's matrix.def format
To use your customized system.dic
, place sudachi.json to anywhere you like, and overwrite systemDict
value with the relative path from sudachi.json
to your system.dic
.
{
"systemDict" : "relative/path/to/system.dic",
...
}
Then specify your sudachi.json
with the -r
option.
$ sudachipy -r path/to/sudachi.json
Run scripts/format.sh
to check if your code is formatted correctly.
You need packages flake8
flake8-import-order
flake8-buitins
(See requirements.txt
).
Run scripts/test.sh
to run the tests.
Sudachi and SudachiPy are developed by WAP Tokushima Laboratory of AI and NLP.
Open an issue, or come to our Slack workspace for questions and discussion.
https://sudachi-dev.slack.com/ (Get invitation here)
Enjoy tokenization!