Sudachi for Transformers (chiTra)

chiTra is a Japanese tokenizer for Transformers.

chiTra stands for Sudachi for Transformers.

Quick Tour

>>> from transformers import BertModel
>>> from sudachitra import BertSudachipyTokenizer

>>> tokenizer = BertSudachipyTokenizer.from_pretrained('sudachitra-bert-base-japanese-sudachi')
>>> model = BertModel.from_pretrained('sudachitra-bert-base-japanese-sudachi')
>>> model(**tokenizer("まさにオールマイティーな商品だ。", return_tensors="pt")).last_hidden_state

Pre-trained BERT models and tokenizer are coming soon!

Installation

$ pip install sudachitra

The default Sudachi dictionary is SudachiDict-core. You can use other dictionaries, such as SudachiDict-small and SudachiDict-full. In such cases, you need to install the dictionaries.

$ pip install sudachidict_small sudachidict_full

Pretraining

Please refer to pretraining/bert/README.md.

Roadmap

Releasing pre-trained models for BERT
Adding tests
Updating documents

For Developers

TBD

Contact

Sudachi and SudachiTra are developed by WAP Tokushima Laboratory of AI and NLP.

Open an issue, or come to our Slack workspace for questions and discussion.

https://sudachi-dev.slack.com/ (Get invitation here)

Enjoy tokenization!