Tensorflow solution of named entity recognition task using Google AI's pre-trained BERT model.
BERT is a NLP model developed by Google for pre-training language representations. It leverages an enormous amount of plain text data publicly available on the web and is trained in an unsupervised manner. Pre-training a BERT model is a fairly expensive yet one-time procedure for each language. Fortunately, Google released several pre-trained models where you can download from here.
Using BERT has two stages: Pre-training and fine-tuning.
Pre-training is fairly expensive (four days on 4 to 16 Cloud TPUs), but is a one-time procedure for each language. Google released a number of pre-trained models from the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding which were pre-trained at Google.
Fine-tuning is inexpensive. All of the results in the paper can be replicated in at most 1 hour on a single Cloud TPU, or a few hours on a GPU, starting from the exact same pre-trained model. SQuAD, for example, can be trained in around 30 minutes on a single Cloud TPU to achieve a Dev F1 score of 91.0%, which is the single system state-of-the-art.
The other important aspect of BERT is that it can be adapted to many types of NLP tasks very easily. In the paper mentioned above, they demonstrate state-of-the-art results on sentence-level (e.g., SST-2), sentence-pair-level (e.g., MultiNLI), word-level (e.g., NER), and span-level (e.g., SQuAD) tasks with almost no task-specific modifications.
Download the pretrained model for Chinese here, then uncompress the zip file into this path:/tmp/DI4Text/diModules/ChineseNER/chinese_L-12_H-768_A-12/
Put train/dev/test dataset in this path: /tmp/DI4Text/diModules/ChineseNER/NERdata/
. The file name of the train/dev/test dataset should be train.txt
, dev.txt
, test.txt
. The format of dataset will be introduced later.
In the file /tmp/DI4Text/diModules/ChineseNER/config.ini
, label_list
specifies the named entity labels. Modify the label list based on your dataset.
💡Note: The labels should be splitted by ',
'.For example, there are three labels PER
, ORG
, LOC
in the training dataset, then label_list
should be PER,ORG,LOC
.
training.py
is the entrypoint for training. Run
python training.py
The fine-tuned model will be stored at /tmp/DI4Text/diModules/ChineseNER/output/
.
If you look into /tmp/DI4Text/diModules/ChineseNER/output/
, it contains something like:
checkpoint 128
entity_level_predicted_result.txt 1.0K
eval.tf_record 219K
events.out.tfevents.1545202214 6.1M
graph.pbtxt 9.0M
label_list.pkl 1K
label_test.txt 2374K
label2id.pkl 1K
model.ckpt-0.data-00000-of-00001 1.3G
model.ckpt-0.index 23K
model.ckpt-0.meta 3.9M
model.ckpt-500.data-00000-of-00001 1.3G
model.ckpt-500.index 23K
model.ckpt-500.meta 3.9M
predict.tf_record 3340K
token_test.txt 1208K
train.tf_record 2.0M
One may get model.ckpt-123.data-00000-of-00001
or model.ckpt-9876.data-00000-of-00001
depending on the total training steps). Now we have collected all three pieces of information that are needed for serving this fine-tuned model:
- The pretrained model is downloaded to
/tmp/DI4Text/diModules/ChineseNER/chinese_L-12_H-768_A-12/
- Our fine-tuned model is stored at
/tmp/DI4Text/diModules/ChineseNER/output/
; - Our fine-tuned model checkpoint is named as
model.ckpt-500
.
The prediction result on the test dataset will be evaluated using /tmp/DI4Text/diModules/ChineseNER/colleval.py
, and the evaluation result will be stored in /tmp/DI4Text/diModules/ChineseNER/output/entity_level_predicted_result.txt
.
This is the result on the dataset SIGHAN Bakeoff 2006
for NER task. The measures that will be reported are precision, recall, and FB1. :
processed 214490 tokens with 7450 phrases; found: 7418 phrases; correct: 6737.
accuracy: 99.10%; precision: 90.82%; recall: 90.43%; FB1: 90.62
LOC: precision: 92.15%; recall: 90.88%; FB1: 91.51 3416
ORG: precision: 84.26%; recall: 85.27%; FB1: 84.76 2192
PER: precision: 96.24%; recall: 95.71%; FB1: 95.98 1810
💡Note: The labels should be in the format X-LABEL
, e.g. B-LOC
, I-GPEC
. colleval.py
cannot deal with other label formats.
main.py
is the entrypoint for the prediction. Run
python main.py -m /path/to/output -o /path/to/result.xlsx /path/to/input.txt
-
/path/to/output
: Specify the path to the directory containing the model files. For instance/tmp/DI4Text/diModules/ChineseNER/output/
-
/path/to/result.xlsx
: Specify the path to the Excel file which will contain the analysis results. Note: If the file already exists and cannot be overwritten, a temporary file will be created instead. Check the console log to find which file was actually written. -
/path/to/input.txt
: Specify the path to the file which contains the text you want to be predicted.
Train/dev/test dataset should be in this format:
海 O
钓 O
比 O
赛 O
地 O
点 O
在 O
厦 B-LOC
门 I-LOC
与 O
金 B-LOC
门 I-LOC
之 O
间 O
的 O
海 O
域 O
。 O
- There are two characters in each line: the first character is the token, the second is the label of the token. These two characters are separated by a space.
- The sentences are separated by a blank line.
- The length of each sentence could not exceed max_seq_length. If the length of one sentence is longer than the maximum value, the exceeded part of the sentence will be ignored.
- 💡Note: The labels should be in the format
X-LABEL
, e.g.B-LOC
,I-GPEC
. Other label format can be used to do the training, testing and prediction, but the evaluation cannot be processed properly. - 💡Note: The first line could not be blank; there should be only one blank line after the last sentence.
Hyperparameters are stored in /tmp/DI4Text/diModules/ChineseNER/config.ini
.
Argument | Type | Default | Description |
---|---|---|---|
task_name |
str | ner | The name of the task to train. |
do_lower_case |
bool | True | Whether to lower case the input text. |
max_seq_length |
int | 128 | The maximum total input sequence length. |
clean |
bool | True | Remove the files which created by last training. |
do_train |
bool | True | Whether to run training. |
do_eval |
bool | True | Whether to run eval on the dev set. |
do_predict |
bool | True | Whether to run the model in inference mode on the test set. |
batch_size |
int | 128 | Total batch size. |
train_batch_size |
int | 128 | Total batch size for training. |
eval_batch_size |
int | 64 | Total batch size for eval. |
predict_batch_size |
int | 64 | Total batch size for predict. |
learning_rate |
float | 8e-6 | The initial learning rate for Adam. |
num_train_epochs |
int | 2 | Total number of training epochs to perform. |
dropout_rate |
float | 0.5 | Dropout rate |
clip |
float | 5 | Gradient clip |
warmup_proportion |
float | 0.1 | Proportion of training to perform linear learning rate warmup for. E.g., 0.1 = 10% of training. |
save_summary_steps |
int | 500 | How often to save the model summary. |
save_checkpoints_steps |
int | 500 | How often to save the model checkpoint. |
iterations_per_loop |
int | 1000 | How many steps to make in each estimator call. |
cell |
list | lstm | Which rnn cell used, valid values are lstm and gru . |
lstm_size |
int | 128 | Size of lstm units. |
num_layers |
int | 2 | Number of rnn layers. |