NICT BERT LOADER

Loading NICT BERT in Huggingface Transformers style.
You can use tokenizer without pre-tokenizing by mecab.

[NOTE] this loader overwrite member-variable of BertJapaneseTokenizer class (BertJapaneseTokenizer.word_tokenizer). You should use this loader only for personal use like experiments. (NO MODEL DISTRIBUTION PURPOSE)

[注意] 本リポジトリのローダーはBertJapaneseTokenizerのword_tokenizerというメンバ変数を上書きしてしまいます。使用に際しては実験でのコード簡略化等を目的とした個人的な利用を推奨します。学習済みモデルの配布をした場合、モデル利用者もまたこのローダーを利用する必要があります。

requirements / dev-env

commandline
- wget
- mecab
- mecab-jumandic (see instruction below)
python
- transformers==4.9.0
- mecab-python3==1.0.4
- mojimoji==0.0.11

installing mecab-jumandic

install from mecab-jumandic via apt causes error.
you should manually install.
- you can use script/install_jumandic.sh

$ bash install_jumandic.sh

how to use

1. move this repo to working directory

cd /path/to/this/repo
cp -r ./nict_bert_loader /path/to/working/directory/

1. import load_nict_bert function, and use.

from nict_bert_loader import load_nict_bert

tokenizer, model = load_nict_bert("32K_BPE")

texts = [
    "NICT版のBERTを事前形態素分割無しで利用することができます。",
    "呼び出し部分だけはif文で処理する必要がありますが", 
    "Transformersにある他の日本語版BERTと同じ学習コードで利用できます。",
    "ただし、学習済みモデル配布目的の場合は注意が必要です。"
]

tokenized = tokenizer(
    texts, padding=True, truncation=True, return_tensors="pt"
)

hs, cls = self.model(**tokenized, return_dict=False)

args of load_nict_bert

model_type [str]: specify 32K_BPE or 100K.
task [class]: class of transformers task like BertForQuestionAnswering. default is AutoModel.
config_file [str]: name of config file, default is config.json.
weight_file [str]: name of weight file, default is pytorch_model.bin,
jumandic_path [str]: path to jumandic. default is None (auto detect by mecab-config --dicdir).
mecabrc_path [str]: path to mecabrc. default is /etc/mecabrc.

t-gappy/nict_bert_loader

NICT BERT LOADER

requirements / dev-env

installing mecab-jumandic

how to use

args of load_nict_bert