Chinese Named Entity Recognition

Tensorflow solution of named entity recognition task using Google AI's pre-trained BERT model.

What is BERT

BERT is a NLP model developed by Google for pre-training language representations. It leverages an enormous amount of plain text data publicly available on the web and is trained in an unsupervised manner. Pre-training a BERT model is a fairly expensive yet one-time procedure for each language. Fortunately, Google released several pre-trained models where you can download from here.

Using BERT has two stages: Pre-training and fine-tuning.

Pre-training is fairly expensive (four days on 4 to 16 Cloud TPUs), but is a one-time procedure for each language. Google released a number of pre-trained models from the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding which were pre-trained at Google.

Fine-tuning is inexpensive. All of the results in the paper can be replicated in at most 1 hour on a single Cloud TPU, or a few hours on a GPU, starting from the exact same pre-trained model. SQuAD, for example, can be trained in around 30 minutes on a single Cloud TPU to achieve a Dev F1 score of 91.0%, which is the single system state-of-the-art.

The other important aspect of BERT is that it can be adapted to many types of NLP tasks very easily. In the paper mentioned above, they demonstrate state-of-the-art results on sentence-level (e.g., SST-2), sentence-pair-level (e.g., MultiNLI), word-level (e.g., NER), and span-level (e.g., SQuAD) tasks with almost no task-specific modifications.

How to train

1. Download a Pre-trained BERT Model

Download the pretrained model for Chinese here, then uncompress the zip file into this path:/tmp/DI4Text/diModules/ChineseNER/chinese_L-12_H-768_A-12/

2. Add Dataset

Put train/dev/test dataset in this path: /tmp/DI4Text/diModules/ChineseNER/NERdata/. The file name of the train/dev/test dataset should be train.txt, dev.txt, test.txt. The format of dataset will be introduced later.

3. Add Labels

In the file /tmp/DI4Text/diModules/ChineseNER/config.ini, label_list specifies the named entity labels. Modify the label list based on your dataset.

💡Note: The labels should be splitted by ','.For example, there are three labels PER, ORG, LOC in the training dataset, then label_list should be PER,ORG,LOC.

4. Train model

training.py is the entrypoint for training. Run

python training.py

The fine-tuned model will be stored at /tmp/DI4Text/diModules/ChineseNER/output/.

If you look into /tmp/DI4Text/diModules/ChineseNER/output/, it contains something like:

checkpoint                                        128
entity_level_predicted_result.txt                 1.0K
eval.tf_record                                    219K
events.out.tfevents.1545202214                    6.1M
graph.pbtxt                                       9.0M
label_list.pkl                                    1K
label_test.txt                                    2374K
label2id.pkl                                      1K
model.ckpt-0.data-00000-of-00001                  1.3G
model.ckpt-0.index                                23K
model.ckpt-0.meta                                 3.9M
model.ckpt-500.data-00000-of-00001                1.3G
model.ckpt-500.index                              23K
model.ckpt-500.meta                               3.9M
predict.tf_record                                 3340K
token_test.txt                                    1208K
train.tf_record                                   2.0M

One may get model.ckpt-123.data-00000-of-00001 or model.ckpt-9876.data-00000-of-00001 depending on the total training steps). Now we have collected all three pieces of information that are needed for serving this fine-tuned model:

The pretrained model is downloaded to /tmp/DI4Text/diModules/ChineseNER/chinese_L-12_H-768_A-12/
Our fine-tuned model is stored at /tmp/DI4Text/diModules/ChineseNER/output/;
Our fine-tuned model checkpoint is named as model.ckpt-500.

5. Evaluate the model

The prediction result on the test dataset will be evaluated using /tmp/DI4Text/diModules/ChineseNER/colleval.py, and the evaluation result will be stored in /tmp/DI4Text/diModules/ChineseNER/output/entity_level_predicted_result.txt.

This is the result on the dataset SIGHAN Bakeoff 2006 for NER task. The measures that will be reported are precision, recall, and FB1. :

processed 214490 tokens with 7450 phrases; found: 7418 phrases; correct: 6737.
accuracy:  99.10%; precision:  90.82%; recall:  90.43%; FB1:  90.62
              LOC: precision:  92.15%; recall:  90.88%; FB1:  91.51  3416
              ORG: precision:  84.26%; recall:  85.27%; FB1:  84.76  2192
              PER: precision:  96.24%; recall:  95.71%; FB1:  95.98  1810

💡Note: The labels should be in the format X-LABEL, e.g. B-LOC, I-GPEC. colleval.py cannot deal with other label formats.

How to predict

main.py is the entrypoint for the prediction. Run

python main.py -m /path/to/output -o /path/to/result.xlsx /path/to/input.txt

/path/to/output: Specify the path to the directory containing the model files. For instance /tmp/DI4Text/diModules/ChineseNER/output/
/path/to/result.xlsx: Specify the path to the Excel file which will contain the analysis results. Note: If the file already exists and cannot be overwritten, a temporary file will be created instead. Check the console log to find which file was actually written.
/path/to/input.txt: Specify the path to the file which contains the text you want to be predicted.

Dataset Format

Train/dev/test dataset should be in this format:

海 O
钓 O
比 O
赛 O
地 O
点 O
在 O
厦 B-LOC
门 I-LOC
与 O
金 B-LOC
门 I-LOC
之 O
间 O
的 O
海 O
域 O
。 O

There are two characters in each line: the first character is the token, the second is the label of the token. These two characters are separated by a space.
The sentences are separated by a blank line.
The length of each sentence could not exceed max_seq_length. If the length of one sentence is longer than the maximum value, the exceeded part of the sentence will be ignored.
💡Note: The labels should be in the format X-LABEL, e.g. B-LOC, I-GPEC. Other label format can be used to do the training, testing and prediction, but the evaluation cannot be processed properly.
💡Note: The first line could not be blank; there should be only one blank line after the last sentence.

Hyperparameters

Hyperparameters are stored in /tmp/DI4Text/diModules/ChineseNER/config.ini.

Argument	Type	Default	Description
`task_name`	str	ner	The name of the task to train.
`do_lower_case`	bool	True	Whether to lower case the input text.
`max_seq_length`	int	128	The maximum total input sequence length.
`clean`	bool	True	Remove the files which created by last training.
`do_train`	bool	True	Whether to run training.
`do_eval`	bool	True	Whether to run eval on the dev set.
`do_predict`	bool	True	Whether to run the model in inference mode on the test set.
`batch_size`	int	128	Total batch size.
`train_batch_size`	int	128	Total batch size for training.
`eval_batch_size`	int	64	Total batch size for eval.
`predict_batch_size`	int	64	Total batch size for predict.
`learning_rate`	float	8e-6	The initial learning rate for Adam.
`num_train_epochs`	int	2	Total number of training epochs to perform.
`dropout_rate`	float	0.5	Dropout rate
`clip`	float	5	Gradient clip
`warmup_proportion`	float	0.1	Proportion of training to perform linear learning rate warmup for. E.g., 0.1 = 10% of training.
`save_summary_steps`	int	500	How often to save the model summary.
`save_checkpoints_steps`	int	500	How often to save the model checkpoint.
`iterations_per_loop`	int	1000	How many steps to make in each estimator call.
`cell`	list	lstm	Which rnn cell used, valid values are `lstm` and `gru`.
`lstm_size`	int	128	Size of lstm units.
`num_layers`	int	2	Number of rnn layers.

Neural Network Structure

For a given token, its input respresentation is constructed by summing the corresponding token, sement, and position embeddings. The character-level embeddings are the output of BERT. These embeddings are used as the input of the corresponding slot in the bidirectional LSTM. In turn, the output of the Bi-LSTM slot is noted as hi and used as input for the CRF. Eventually, the CRF slot emits prediction of labels. Token “长” is the 1-th token in the sentence and is predicted as “B-LOC”.

JingYannn/ChineseNER