/stakk

STAtistical Kana Kanji conversion engine; predictive input method; spelling correction; morphological analysis.

Primary LanguageC++

StaKK: Statistical Kana Kanji conversion engine

StaKK is Japanese Language Processer with following features.

* Current Features
Kana-Kanji converter
Predictive Input or Query Suggestion
Spelling Correction
Morphological Analyzer
HTTP API Server
Raw Trie Operations

Currently, StaKK uses dictionary of Mozc, OSS version of Google Japanese IME.

* Reverse Mode
StaKK implements two mode: normal mode and reverse mode.

Normal Mode: Input Reading or Kana, Output Word or Kanji.
Reverse Mode:   Input Word or Kanji, Output Reading or POS tag etc.

These two methods can be applied for the purpos in the following table.

| Method        | Normal Mode           | Reverse Mode          |
| Convert       | Kana-Kanji Conversion | Morphological Analyze |
| Predict       | Predictive Input      | Query Suggestion      |
| Spell Correct | Correct with Reading  | Correct with Surface  |

* Usage
Common Options:
    command [-r] [-d dictionary] [-c conneciton]
        -d dictionary:  mozc format dictionary file path. default: data/dictinoary.txt
        -c connection:  mozc format connection file path. default: data/connection.txt
        -i id.def:  mozc format class id definition file path. default: data/id.def
                This option is not enable for stakk_test command.
        -r:     "Reverse" option for morphological analyzer.
                If reverse option is set, input word surface instead of reading.

Command Specific Options:
    converter_test [-i id.def] [-o mecab|wakati|yomi]
        Execute Kana-Kanji Conversion or Morphological Analyze on Command Line.
        -i id.def:  Mozc format class id definition file path. default: data/id.def
        -o output:  Output format: one of "mecab", "wakati", "yomi". default: mecab

    stakk_test [-m predict|spell] [-t threshold]
        Execute Prediction or Spelling Correction on Command Line.
        -m mode:        Retrieval mode: one of "predict", "spell". default: spell
        -t threshold:   Spelling correction threhsold of edit distance. default: 1

    stakk_server [-i id.def] [-p port]
        Start HTTP API Server.
        -p port:    Port number of HTTP API server.

HTTP API Server Path:
    General Format
        /apiname/method/query[options]
        
        apiname:    Any name is acceptable. recommend: api
        method:     One of "convert", "predict", "spell".
        query:      Input Query of Japanese String.

    Convert Method
        /apiname/convert/query/format

        format:     One of "wakati", "mecab", "yomi". default: wakati
    
    Predict Method
        /apiname/predict/query/number

        number:     Maximum Display Number of Candidates. default: 50

    Spell Method
        /apiname/predict/query/threshold/number

        threshold:  Threshold of Edit Distance. default: 1
                    DON'T set this larger than 2, which will down the server.
        number:     Maximum Display Number of Candidates. default: 50

    

* Developer Environment
CentOS 5.5
gcc 4.1.2

* Version
v1.0 2010/11/23 first release.

* Examples
# Compilation
$ make

# Kana-Kanji Conversion
$ ./converter_test -o wakati
loading dictionary
loading connection
loading id definition
input query:
わたしのなまえはなかのです。
私 の 名前 は 中野 です 。

# Predictive Input
$ ./stakk_test -m predict
loading dictionary
loading connection
input query:
あり    (Input)
ありがとう      ありがとう
ありがとう      有難う
ありがと        ありがと
ありあんつかさいかいじょうほけん        アリアンツ火災海上保険
ありえってぃ    アリエッティ
etc.

# Spelling Correction
$ ./stakk_test -m spell
loading dictionary
loading connection
input query:
れみおめろん    (Input)
れみおろめん    レミオロメン
あみおだろん    アミオダロン
ぐーgる    (Input)
ぐーぐる        グーグル
ぐーる  グール
etc.

# Morphological Analyzer
$ ./converter_test -r -o mecab
loading dictionary
loading connection
loading id definition
input query:
東京都に住む
東京都  とうきょうと    名詞,固有名詞,地域,一般,*,*,都名        名詞,接尾,地域,*,*,*,*
に      に      助詞,格助詞,一般,*,*,*,に       助詞,格助詞,一般,*,*,*,に
住む    すむ    動詞,自立,*,*,五段動詞,基本形,5 動詞,自立,*,*,五段動詞,基本形,5
EOS

# HTTP API Server
$ ./stakk_server -p 50000 &
loading dictionary
loading connection
loading id definition
server ready
$ curl "http://localhost:50000/api/convert/わたしのなまえはなかのです。"
私 の 名前 は 中野 です 。
$ curl "http://localhost:50000/api/predict/きょうの"
きょうのうんせい        今日の運勢
きょうのてんき  今日の天気
きょうのばんぐみ        今日の番組
きょうのひとこと        今日の一言
etc.
$ curl "http://localhost:50000/api/spell/れみおめろん"
れみおろめん    レミオロメン
$ curl "http://localhost:50000/api/spell/れみおめろん/2"
れみおろめん    レミオロメン
あみおだろん    アミオダロン

# HTTP API Server with Reverse Mode
$ ./stakk_server -rp 50001 &
loading dictionary
loading connection
loading id definition
server ready
$ curl "http://localhost:50001/api/convert/東京都に住む/mecab"
東京都  とうきょうと    名詞,固有名詞,地域,一般,*,*,都名        名詞,接尾,地域,*,*,*,*
に      に      助詞,格助詞,一般,*,*,*,に       助詞,格助詞,一般,*,*,*,に
住む    すむ    動詞,自立,*,*,五段動詞,基本形,5 動詞,自立,*,*,五段動詞,基本形,5
EOS
$ curl "http://localhost:50001/api/spell/テソション"
てんしょん      テンション
ていしょん      テイション
てーしょん      テーション