
A microservice that makes it easy to tokenize Japanese texts. Useful for making word clouds, analyzing texts, or getting the most important words from a sentence.

Primary LanguagePython

Japanese Tokenizer Microservice

A microservice that makes it easy to tokenize Japanese texts. Useful for making word clouds, analyzing texts, or getting the most important words from a sentence.

It uses Flask (lightweight Python web framework), and MeCab for text segmentation.

POST /important_words
  "text": "プログラミング言語は、その開発の背景や機能などの影響を受け、言語によって得意とする分野は異なります。"

# Result:

  "result": [
    "プログラミング", "言語", "開発", "背景",
    "機能", "影響", "受ける", "言語", "得意",
    "する", "分野", "異なる"

Adding ?metadata=true gives you extra information:

POST /important_words?metadata=true
  "text": "猫が可愛い"

# Result:

  "result": [
    {"word": "猫", "type": "名詞", "reading": "ネコ"},
    {"word": "可愛い", "type": "形容詞", "reading": "カワイイ"}


Requires Python 3 (created using 3.8.10) and pip3.

Install dependencies:

pip3 install -r requirements.txt

Install mecab and dictionaries. There are many ways, but here's one that can help: https://qiita.com/ekzemplaro/items/c98c7f6698f130b55d53

Test mecab by executing this command:

echo "辞書" | mecab

And the result should be:

辞書    名詞,一般,*,*,*,*,辞書,ジショ,ジショ

Run app

export FLASK_APP=server.py
export FLASK_RUN_PORT=45678 # Your desired port.
export FLASK_ENV=development
flask run

And test using:

curl -X POST http://localhost:45678/important_words -d "{\"text\": \"猫が可愛かった\"}" -H "Content-Type: application/json"

Result should be:

  "result": ["", "可愛い"]

Possible Errors

Cannot find mecab path

Sometimes the app fails because it doesn't find mecab. Find and set it using:

# Returns its path, in my case /etc/mecabrc
sudo find / -iname mecabrc

# Set environment variable.
export MECABRC='/etc/mecabrc'

# Run app.
flask run