chiTra is a Japanese tokenizer for Transformers.
chiTra stands for Sudachi for Transformers.
>>> from transformers import BertModel
>>> from sudachitra import BertSudachipyTokenizer
>>> tokenizer = BertSudachipyTokenizer.from_pretrained('sudachitra-bert-base-japanese-sudachi')
>>> model = BertModel.from_pretrained('sudachitra-bert-base-japanese-sudachi')
>>> model(**tokenizer("まさにオールマイティーな商品だ。", return_tensors="pt")).last_hidden_state
Pre-trained BERT models and tokenizer are coming soon!
$ pip install sudachitra
The default Sudachi dictionary is SudachiDict-core. You can use other dictionaries, such as SudachiDict-small and SudachiDict-full. In such cases, you need to install the dictionaries.
$ pip install sudachidict_small sudachidict_full
Please refer to pretraining/bert/README.md.
- Releasing pre-trained models for BERT
- Adding tests
- Updating documents
TBD
Sudachi and SudachiTra are developed by WAP Tokushima Laboratory of AI and NLP.
Open an issue, or come to our Slack workspace for questions and discussion.
https://sudachi-dev.slack.com/ (Get invitation here)
Enjoy tokenization!