/SudachiPy

Python version of Sudachi, a Japanese tokenizer.

Primary LanguagePythonApache License 2.0Apache-2.0

SudachiPy

PyPi version Build Status

日本語

SudachiPy is a Python version of Sudachi, a Japanese morphological analyzer.

TL;DR

$ pip install sudachipy sudachidict_core

$ echo "高輪ゲートウェイ駅" | sudachipy
高輪ゲートウェイ駅	名詞,固有名詞,一般,*,*,*	高輪ゲートウェイ駅
EOS

$ echo "高輪ゲートウェイ駅" | sudachipy -m A
高輪	名詞,固有名詞,地名,一般,*,*	高輪
ゲートウェイ	名詞,普通名詞,一般,*,*,*	ゲートウェー
駅	名詞,普通名詞,一般,*,*,*	駅
EOS

$ echo "空缶空罐空きカン" | sudachipy -a
空缶	名詞,普通名詞,一般,*,*,*	空き缶	空缶	アキカン	0
空罐	名詞,普通名詞,一般,*,*,*	空き缶	空罐	アキカン	0
空きカン	名詞,普通名詞,一般,*,*,*	空き缶	空きカン	アキカン	0
EOS

Setup

You need SudachiPy and a dictionary.

Step 1. Install SudachiPy

$ pip install sudachipy

Step 2. Get a Dictionary

You can get dictionary as a Python package. It make take a while to download the dictionary file (around 70MB for the core edition).

$ pip install sudachidict_core

Alternatively, you can choose other dictionary editions. See this section for the detail.

Usage: As a command

There is a CLI command sudachipy.

$ echo "外国人参政権" | sudachipy
外国人参政権	名詞,普通名詞,一般,*,*,*	外国人参政権
EOS
$ echo "外国人参政権" | sudachipy -m A
外国	名詞,普通名詞,一般,*,*,*	外国
人	接尾辞,名詞的,一般,*,*,*	人
参政	名詞,普通名詞,一般,*,*,*	参政
権	接尾辞,名詞的,一般,*,*,*	権
EOS
$ sudachipy tokenize -h
usage: sudachipy tokenize [-h] [-r file] [-m {A,B,C}] [-o file] [-s string]
                          [-a] [-d] [-v]
                          [file [file ...]]

Tokenize Text

positional arguments:
  file           text written in utf-8

optional arguments:
  -h, --help     show this help message and exit
  -r file        the setting file in JSON format
  -m {A,B,C}     the mode of splitting
  -o file        the output file
  -s string      sudachidict type
  -a             print all of the fields
  -d             print the debug information
  -v, --version  print sudachipy version

Output

Columns are tab separated.

  • Surface
  • Part-of-Speech Tags (comma separated)
  • Normalized Form

When you add the -a option, it additionally outputs

  • Dictionary Form
  • Reading Form
  • Dictionary ID
    • 0 for the system dictionary
    • 1 and above for the user dictionaries
    • -1\t(OOV) if a word is Out-of-Vocabulary (not in the dictionary)
$ echo "外国人参政権" | sudachipy -a
外国人参政権	名詞,普通名詞,一般,*,*,*	外国人参政権	外国人参政権	ガイコクジンサンセイケン	0
EOS
echo "阿quei" | sudachipy -a
阿	名詞,普通名詞,一般,*,*,*	阿	阿		-1	(OOV)
quei	名詞,普通名詞,一般,*,*,*	quei	quei		-1	(OOV)
EOS

Usage: As a Python package

Here is an example;

from sudachipy import tokenizer
from sudachipy import dictionary

tokenizer_obj = dictionary.Dictionary().create()
# Multi-granular Tokenization

mode = tokenizer.Tokenizer.SplitMode.C
[m.surface() for m in tokenizer_obj.tokenize("国家公務員", mode)]
# => ['国家公務員']

mode = tokenizer.Tokenizer.SplitMode.B
[m.surface() for m in tokenizer_obj.tokenize("国家公務員", mode)]
# => ['国家', '公務員']

mode = tokenizer.Tokenizer.SplitMode.A
[m.surface() for m in tokenizer_obj.tokenize("国家公務員", mode)]
# => ['国家', '公務', '員']
# Morpheme information

m = tokenizer_obj.tokenize("食べ", mode)[0]

m.surface() # => '食べ'
m.dictionary_form() # => '食べる'
m.reading_form() # => 'タベ'
m.part_of_speech() # => ['動詞', '一般', '*', '*', '下一段-バ行', '連用形-一般']
# Normalization

tokenizer_obj.tokenize("附属", mode)[0].normalized_form()
# => '付属'
tokenizer_obj.tokenize("SUMMER", mode)[0].normalized_form()
# => 'サマー'
tokenizer_obj.tokenize("シュミレーション", mode)[0].normalized_form()
# => 'シミュレーション'

(With 20200330 core dictionary. The results may change when you use other versions)

Dictionary Edition

**WARNING: sudachipy link is no longer available in SudachiPy v0.5.2 and later. **

There are three editions of Sudachi Dictionary, namely, small, core, and full. See WorksApplications/SudachiDict for the detail.

SudachiPy uses sudachidict_core by default.

Dictionaries are installed as Python packages sudachidict_small, sudachidict_core, and sudachidict_full.

The dictionary files are not in the package itself, but it is downloaded upon installation.

Dictionary option: command line

You can specify the dictionary with the tokenize option -s.

$ pip install sudachidict_small
$ echo "外国人参政権" | sudachipy -s small
$ pip install sudachidict_full
$ echo "外国人参政権" | sudachipy -s full

Dictionary option: Python package

You can specify the dictionary with the Dicionary() argument; config_path or dict_type.

class Dictionary(config_path=None, resource_dir=None, dict_type=None)
  1. config_path
    • You can specify the file path to the setting file with config_path (See [Dictionary in The Setting File](#Dictionary in The Setting File) for the detail).
    • If the dictionary file is specified in the setting file as systemDict, SudachiPy will use the dictionary.
  2. dict_type
    • You can also specify the dictionary type with dict_type.
    • The available arguments are small, core, or full.
    • If different dictionaries are specified with config_path and dict_type, a dictionary defined dict_type overrides those defined in the config path.
from sudachipy import tokenizer
from sudachipy import dictionary

# default: sudachidict_core
tokenizer_obj = dictionary.Dictionary().create()  

# The dictionary given by the `systemDict` key in the config file (/path/to/sudachi.json) will be used
tokenizer_obj = dictionary.Dictionary(config_path="/path/to/sudachi.json").create()  

# The dictionary specified by `dict_type` will be set.
tokenizer_obj = dictionary.Dictionary(dict_type="core").create()  # sudachidict_core (same as default)
tokenizer_obj = dictionary.Dictionary(dict_type="small").create()  # sudachidict_small
tokenizer_obj = dictionary.Dictionary(dict_type="full").create()  # sudachidict_full

# The dictionary specified by `dict_type` overrides those defined in the config path.
# In the following code, `sudachidict_full` will be used regardless of a dictionary defined in the config file. 
tokenizer_obj = dictionary.Dictionary(config_path="/path/to/sudachi.json", dict_type="full").create()  

Dictionary in The Setting File

Alternatively, if the dictionary file is specified in the setting file, sudachi.json, SudachiPy will use that file.

{
    "systemDict" : "relative/path/to/system.dic",
    ...
}

The default setting file is sudachipy/resources/sudachi.json. You can specify your sudachi.json with the -r option.

$ sudachipy -r path/to/sudachi.json

User Dictionary

To use a user dictionary, user.dic, place sudachi.json to anywhere you like, and add userDict value with the relative path from sudachi.json to your user.dic.

{
    "userDict" : ["relative/path/to/user.dic"],
    ...
}

Then specify your sudachi.json with the -r option.

$ sudachipy -r path/to/sudachi.json

You can build a user dictionary with the subcommand ubuild.

WARNING: v0.3.* ubuild contains bug.

$ sudachipy ubuild -h
usage: sudachipy ubuild [-h] [-d string] [-o file] [-s file] file [file ...]

Build User Dictionary

positional arguments:
  file        source files with CSV format (one or more)

optional arguments:
  -h, --help  show this help message and exit
  -d string   description comment to be embedded on dictionary
  -o file     output file (default: user.dic)
  -s file     system dictionary path (default: system core dictionary path)

About the dictionary file format, please refer to this document (written in Japanese, English version is not available yet).

Customized System Dictionary

$ sudachipy build -h
usage: sudachipy build [-h] [-o file] [-d string] -m file file [file ...]

Build Sudachi Dictionary

positional arguments:
  file        source files with CSV format (one of more)

optional arguments:
  -h, --help  show this help message and exit
  -o file     output file (default: system.dic)
  -d string   description comment to be embedded on dictionary

required named arguments:
  -m file     connection matrix file with MeCab's matrix.def format

To use your customized system.dic, place sudachi.json to anywhere you like, and overwrite systemDict value with the relative path from sudachi.json to your system.dic.

{
    "systemDict" : "relative/path/to/system.dic",
    ...
}

Then specify your sudachi.json with the -r option.

$ sudachipy -r path/to/sudachi.json

For Developers

Cython Build

$ python setup.py build_ext --inplace

Code Format

Run scripts/format.sh to check if your code is formatted correctly.

You need packages flake8 flake8-import-order flake8-buitins (See requirements.txt).

Test

Run scripts/test.sh to run the tests.

Contact

Sudachi and SudachiPy are developed by WAP Tokushima Laboratory of AI and NLP.

Open an issue, or come to our Slack workspace for questions and discussion.

https://sudachi-dev.slack.com/ (Get invitation here)

Enjoy tokenization!