/jp_tokenizer

A tokenizer and lemmatizer for Japanese text

Primary LanguagePythonMIT LicenseMIT

CI status Docker status

Deploy to Azure

jp_tokenizer

This repository contains a tiny web service that lets you tokenize and lemmatize Japanese text.

The service is implemented by wrapping the MeCab tokenizer (paper) in a Sanic app.

Usage

Ensure that your server has at least 2-3GB of available RAM (e.g. Azure Standard DS1_v2) and then run:

# start a container for the service and its dependencies
docker run -p 8080:80 cwolff/jp_tokenizer

# call the API
curl -X POST 'http://localhost:8080/tokenize' --data 'サザエさんは走った'
curl -X POST 'http://localhost:8080/lemmatize' --data 'サザエさんは走った'

The API will respond with a space-delimited string of tokens/lemmas.