/vibrato

🎤 vibrato: Viterbi-based accelerated tokenizer

Primary LanguageRustApache License 2.0Apache-2.0

🎤 vibrato: VIterbi-Based acceleRAted TOkenizer

Crates.io Documentation Build Status

Vibrato is a fast implementation of tokenization (or morphological analysis) based on the Viterbi algorithm.

Features

Fast tokenization

Vibrato is a Rust reimplementation of the fast tokenizer MeCab, although its implementation has been simplified and optimized for even faster tokenization. Especially for language resources with a large matrix (e.g., unidic-cwj-3.1.0 with a matrix of 459 MiB), Vibrato will run faster thanks to cache-efficient id mappings.

For example, the following figure shows an experimental result of tokenization time with MeCab and its reimplementations. The detailed experimental settings and other results are available on Wiki.

MeCab compatibility

Vibrato supports options for outputting tokenized results identical to MeCab, such as ignoring whitespace.

Basic usage

This software is implemented in Rust. First of all, install rustc and cargo following the official instructions.

1. Resource preparation

You can compile a system dictionary from language resources in the MeCab format. The simplest way is using publicly-available resources such as IPADIC or UniDic.

The directory scripts provides scripts to prepare system dictionaries from several public resources.

$ ls -1 scripts
prepare_ipadic-mecab-2_7_0.sh
prepare_ipadic-mecab-neologd-20200910.sh
prepare_unidic-cwj-3_1_0.sh
prepare_unidic-mecab-2_1_2.sh

For example, if you want to use mecab-ipadic v2.7.0, run prepare_ipadic-mecab-2_7_0.sh.

$ ./scripts/prepare_ipadic-mecab-2_7_0.sh

The system dictionary resources_ipadic-mecab-2_7_0/system.dic will be produced.

$ ls resources_ipadic-mecab-2_7_0
system.dic

See the document for preparation steps without these scripts.

2. Tokenization

To tokenize sentences using the system dictionary, run the following command.

$ echo '本とカレーの街神保町へようこそ。' | cargo run --release -p tokenize -- -i resources_ipadic-mecab-2_7_0/system.dic

The resultant tokens will be output in the Mecab format.

本	名詞,一般,*,*,*,*,本,ホン,ホン
と	助詞,並立助詞,*,*,*,*,と,ト,ト
カレー	名詞,固有名詞,地域,一般,*,*,カレー,カレー,カレー
の	助詞,連体化,*,*,*,*,の,ノ,ノ
街	名詞,一般,*,*,*,*,街,マチ,マチ
神保	名詞,固有名詞,地域,一般,*,*,神保,ジンボウ,ジンボー
町	名詞,接尾,地域,*,*,*,町,マチ,マチ
へ	助詞,格助詞,一般,*,*,*,へ,ヘ,エ
ようこそ	感動詞,*,*,*,*,*,ようこそ,ヨウコソ,ヨーコソ
。	記号,句点,*,*,*,*,。,。,。
EOS

If you want to output tokens separated by spaces, specify -O wakati.

$ echo '本とカレーの街神保町へようこそ。' | cargo run --release -p tokenize -- -i resources_ipadic-mecab-2_7_0/system.dic -O wakati
本 と カレー の 街 神保 町 へ ようこそ 。

MeCab-compatible options

Vibrato is a reimplementation of the MeCab algorithm, but with the default settings it can produce different tokens from MeCab.

For example, MeCab ignores spaces (more precisely, SPACE defined in char.def) in tokenization.

$ echo "mens second bag" | mecab
mens	名詞,固有名詞,組織,*,*,*,*
second	名詞,一般,*,*,*,*,*
bag	名詞,固有名詞,組織,*,*,*,*
EOS

However, Vibrato handles such spaces as tokens with the default settings.

$ echo 'mens second bag' | cargo run --release -p tokenize -- -i resources_ipadic-mecab-2_7_0/system.dic
mens	名詞,固有名詞,組織,*,*,*,*
 	記号,空白,*,*,*,*,*
second	名詞,固有名詞,組織,*,*,*,*
 	記号,空白,*,*,*,*,*
bag	名詞,固有名詞,組織,*,*,*,*
EOS

If you want to obtain the same results as MeCab, specify the arguments -S and -M 24.

$ echo 'mens second bag' | cargo run --release -p tokenize -- -i resources_ipadic-mecab-2_7_0/system.dic -S -M 24
mens	名詞,固有名詞,組織,*,*,*,*
second	名詞,一般,*,*,*,*,*
bag	名詞,固有名詞,組織,*,*,*,*
EOS

-S indicates if spaces are ignored. -M indicates the maximum grouping length for unknown words.

Notes

There are corner cases where tokenization results in different outcomes due to cost tiebreakers. However, this would be not an essential problem.

User dictionary

You can use your user dictionary along with the system dictionary. The user dictionary must be in the CSV format.

<surface>,<left-id>,<right-id>,<cost>,<features...>

The first four columns are always required. The others (i.e., <features...>) are optional.

For example,

$ cat user.csv
神保町,1293,1293,334,カスタム名詞,ジンボチョウ
本とカレーの街,1293,1293,0,カスタム名詞,ホントカレーノマチ
ようこそ,3,3,-1000,感動詞,ヨーコソ,Welcome,欢迎欢迎,Benvenuto,Willkommen

To use the user dictionary, specify the file with the -u argument.

$ echo '本とカレーの街神保町へようこそ。' | cargo run --release -p tokenize -- -i resources_ipadic-mecab-2_7_0/system.dic -u user.csv
本とカレーの街	カスタム名詞,ホントカレーノマチ
神保町	カスタム名詞,ジンボチョウ
へ	助詞,格助詞,一般,*,*,*,へ,ヘ,エ
ようこそ	感動詞,ヨーコソ,Welcome,欢迎欢迎,Benvenuto,Willkommen
。	記号,句点,*,*,*,*,。,。,。
EOS

Benchmark

You can measure the tokenization speed for sentences in test.txt.

If you can guarantee that system.dic is exported from this library, you can specify --features=unchecked for faster tokenization.

$ cargo run --release -p benchmark --features=unchecked -- -i resources_ipadic-mecab-2_7_0/system.dic < test.txt

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.