This repository implements an LSTM-CRF model for named entity recognition. The model is same as the one by Lample et al., (2016) except we do not have the last tanh
layer after the BiLSTM.
We achieve the SOTA performance on both CoNLL-2003 and OntoNotes 5.0 English datasets (check our benchmark).
Announcement: The integration with transformers is now available. We are running benchmark experiments on different dataset. Benchmark experiments are coming soon. Stay tuned.
- Python >= 3.6 and PyTorch >= 1.4.0 (tested)
- Transformers package from Huggingface (Required by using Transformers)
If you use conda
:
git clone https://github.com/allanj/pytorch_lstmcrf.git
conda create -n pt_lstmcrf python=3.6
conda activate pt_lstmcrf
# check https://pytorch.org for the suitable version of your machines
conda install pytorch=1.4.0 torchvision cudatoolkit=10.0 -c pytorch -n pt_lstmcrf
pip install tqdm
pip install termcolor
pip install overrides
pip install allennlp
pip install transformers
In the documentation below, we present four ways for users to run the code:
- Run the model via Fine-tuning BERT/Roberta/etc in Transformers package.
- Run the model via static BERT/Roberta/etc in Transformers package.
- Run the model with simply word embeddings.
- Run the model via static ELMo/BERT representations loaded from external vectors.
Our default argument setup refers to the second one 1
.
- Simply replace the
embedder_type
argument with the model in HuggingFace. For example, if we are usingbert-base-cased
, we just need to change the embedder type asbert-base-cased
.python trainer.py --device=cuda:0 --dataset=YourData --model_folder=saved_models --embedder_type=bert-base-cased
- (Optional) Using other models in HuggingFace.
- Check if your prefered language model in
config/transformers_util.py
. If not, add to the utils. For example, if you would like to useBERT-Large
. Add the following line to the dictionary.This name'bert-large-cased' : { "model": BertModel, "tokenizer" : BertTokenizer }
bert-large-cased
has to follow the naming rule by HuggingFace. - Run the main file with modified argument
embedder_type
:The default value forpython trainer.py --embedder_type=bert-large-cased
embedder_type
isnormal
, which refers to the classic LSTM-CRF and we can usestatic_context_emb
in previous section. Changing the name to something likebert-base-cased
orbert-base-uncased
, we directly load the model from huggingface. Note: if you use other models, remember to replace the tokenization mechanism inconfig/utils.py
. - Finally, if you would like to know more about the details, read more details below:
- Tokenization: For BERT, we use the first wordpice to represent a complete word. Check
config/transformers_util.py
- Embedder: We show how to embed the input tokens to make word representation. Check
model/embedder/transformers_embedder.py
- Tokenization: For BERT, we use the first wordpice to represent a complete word. Check
- Check if your prefered language model in
Simply go to model/transformers_embedder.py
and uncomment the following:
self.model.requires_grad = False
Using Word embedding or external contextualized embedding (ELMo/BERT) can be found in here.
- Create a folder
YourData
under the data directory. - Put the
train.txt
,dev.txt
andtest.txt
files (make sure the format is compatible, i.e. the first column is words and the last column are tags) under this directory. If you have a different format, simply modify the reader inconfig/reader.py
. - Change the
dataset
argument toYourData
when you runtrainer.py
.
We trained an English LSTM-CRF (+ELMo) model on the CoNLL-2003 dataset. You can directly predict a sentence with the following piece of code (Note: we do not do tokenization.).
You can download the English model through this link.
from ner_predictor import NERPredictor
sentence = "This is an English model ."
# Or you can make a list of sentence:
# sentence = ["This is an English model", "This is the second sentence"]
model_path = "english_model.tar.gz"
predictor = NERPredictor(model_path, cuda_device="cpu") ## you can use "cuda:0", "cuda:1" for gpu
prediction = predictor.predict(sentence)
print(prediction)
- Benchmark Performance
- Our common practice for NER is actually using ELMo is easier for tunning and obtaining quite good performance compared to BERT. But we did not try other language models.
- Support for ELMo/BERT as features
- Interactive model where we can just import model and decode a setence
- Make the code more modularized (separate the encoder and inference layers) and readable (by adding more comments)
- Put the benchmark performance documentation to another markdown file
- Integrate BERT as a module instead of just features.
- Clean up the code to better organization (e.g.,
import
stuff) - Benchmark experiments for Transformers' based models.
A huge thanks to @yuchenlin for his contribution in this repo.