/bert-as-language-model

bert as language model, fork from

Primary LanguagePythonApache License 2.0Apache-2.0

BERT(日本語Pretrainedモデル) as Language Model

概要

元々のコードオリジナルのBERTを対象に、 入力文章に含まれるトークン(単語)の尤度を出力するものですが、 ここでは、BERT日本語Pretrainedモデルを利用して、 日本語文章に含まれるトークンの尤度も出力できるようにしました。

環境

  • Ubuntu 16.04.4 LTS
  • Python 3.6.0
    • tensorflow==1.14.0
    • pyknp==0.4.1
  • Juman++ Version: 2.0.0-rc2

参考サイト

変更点

テスト結果

  • コマンド
export BERT_BASE_DIR=models/Japanese_L-12_H-768_A-12_E-30_BPE
python run_lm_predict.py \
  --input_file=./data/lm/test.ja.tsv \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
  --do_lower_case False \
  --max_seq_length=128 \
  --output_dir=./data/lm/output/ \
  --jp_tokenizer=True
  • 入力内容
cat ./data/lm/test.ja.tsv 
機械学習で処理する
機会学習で処理する
世間では人工知能が流行している
世間では人口知能が流行している
  • 結果
# prob: probability
# ppl:  perplexity
[
  {
    "tokens": [
      {
        "token": "機械",
        "prob": 0.7197790145874023
      },
      {
        "token": "学習",
        "prob": 0.0011436253553256392
      },
      {
        "token": "",
        "prob": 0.4158846437931061
      },
      {
        "token": "処理",
        "prob": 0.00014628272037953138
      },
      {
        "token": "する",
        "prob": 0.00011202425957890227
      }
    ],
    "ppl": 177.91305425305157
  },
  {
    "tokens": [
      {
        "token": "機会",
        "prob": 4.273203558113892e-06  # low probability
      },
      {
        "token": "学習",
        "prob": 0.00048818811774253845
      },
      {
        "token": "",
        "prob": 0.14289069175720215
      },
      {
        "token": "処理",
        "prob": 0.0002504551666788757
      },
      {
        "token": "する",
        "prob": 8.453470945823938e-05
      }
    ],
    "ppl": 2754.0894828429246
  },
  {
    "tokens": [
      {
        "token": "世間",
        "prob": 0.00029105637804605067
      },
      {
        "token": "",
        "prob": 0.8351001739501953
      },
      {
        "token": "",
        "prob": 0.9587864875793457
      },
      {
        "token": "人工",
        "prob": 0.986724317073822
      },
      {
        "token": "知能",
        "prob": 0.5667852759361267
      },
      {
        "token": "",
        "prob": 0.9329909086227417
      },
      {
        "token": "流行",
        "prob": 0.1348113864660263
      },
      {
        "token": "して",
        "prob": 0.9265641570091248
      },
      {
        "token": "いる",
        "prob": 1.2146945664426312e-05
      }
    ],
    "ppl": 12.065788660694167
  },
  {
    "tokens": [
      {
        "token": "世間",
        "prob": 0.00021123532496858388
      },
      {
        "token": "",
        "prob": 0.8677905201911926
      },
      {
        "token": "",
        "prob": 0.949567437171936
      },
      {
        "token": "人口",
        "prob": 1.7458520233049057e-05  # low probability
      },
      {
        "token": "知能",
        "prob": 3.086066135438159e-05  # low probability
      },
      {
        "token": "",
        "prob": 0.914918839931488
      },
      {
        "token": "流行",
        "prob": 0.051112908869981766
      },
      {
        "token": "して",
        "prob": 0.927436888217926
      },
      {
        "token": "いる",
        "prob": 1.0879161891352851e-05
      }
    ],
    "ppl": 141.40153489831755
  }
]

BERT as Language Model (Original README)

For a sentence S = w_1, w_2,..., w_k , we have

p(S) = \prod_{i=1}^{k} p(w_i | context)

In traditional language model, such as RNN, context = w_1, ..., w_{i-1} ,

p(S) = \prod_{i=1}^{k} p(w_i | w_1, ..., w_{i-1})

In bidirectional language model, it has larger context, context = w_1, ..., w_{i-1},w_{i+1},...,w_k.

In this implementation, we simply adopt the following approximation,

p(S) \approx \prod_{i=1}^{k} p(w_i | w_1, ..., w_{i-1},w_{i+1}, ...,w_k).

test-case

more cases: 中文

export BERT_BASE_DIR=model/uncased_L-12_H-768_A-12
export INPUT_FILE=data/lm/test.en.tsv
python run_lm_predict.py \
  --input_file=$INPUT_FILE \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
  --max_seq_length=128 \
  --output_dir=/tmp/lm_output/

for the following test case

$ cat data/lm/test.en.tsv 
there is a book on the desk
there is a plane on the desk
there is a book in the desk

$ cat /tmp/lm/output/test_result.json

output:

# prob: probability
# ppl:  perplexity
[
  {
    "tokens": [
      {
        "token": "there",
        "prob": 0.9988962411880493
      },
      {
        "token": "is",
        "prob": 0.013578361831605434
      },
      {
        "token": "a",
        "prob": 0.9420605897903442
      },
      {
        "token": "book",
        "prob": 0.07452250272035599
      },
      {
        "token": "on",
        "prob": 0.9607976675033569
      },
      {
        "token": "the",
        "prob": 0.4983428418636322
      },
      {
        "token": "desk",
        "prob": 4.040586190967588e-06
      }
    ],
    "ppl": 17.69329728285426
  },
  {
    "tokens": [
      {
        "token": "there",
        "prob": 0.996775209903717
      },
      {
        "token": "is",
        "prob": 0.03194097802042961
      },
      {
        "token": "a",
        "prob": 0.8877727389335632
      },
      {
        "token": "plane",
        "prob": 3.4907534427475184e-05   # low probability
      },
      {
        "token": "on",
        "prob": 0.1902322769165039
      },
      {
        "token": "the",
        "prob": 0.5981084704399109
      },
      {
        "token": "desk",
        "prob": 3.3164762953674654e-06
      }
    ],
    "ppl": 59.646456254851806
  },
  {
    "tokens": [
      {
        "token": "there",
        "prob": 0.9969795942306519
      },
      {
        "token": "is",
        "prob": 0.03379646688699722
      },
      {
        "token": "a",
        "prob": 0.9095568060874939
      },
      {
        "token": "book",
        "prob": 0.013939591124653816
      },
      {
        "token": "in",
        "prob": 0.000823647016659379  # low probability
      },
      {
        "token": "the",
        "prob": 0.5844194293022156
      },
      {
        "token": "desk",
        "prob": 3.3361218356731115e-06
      }
    ],
    "ppl": 54.65941516205144
  }
]