/blstm-crf-ner

A NER model (B-LSTM + CRF + word embeddings) implemented using Tensorflow

Primary LanguagePythonApache License 2.0Apache-2.0

blstm-crf-ner

A NER model (B-LSTM + CRF + word embeddings) implemented using Tensorflow which is used to tag Turkish noisy data (tweets specifically!) without using any hand-crafted features or rules.

The model is very similar to Lample et al., Gungor, Onur et al. and Ma and Hovy. As a consequence, the source code is also heavily influenced by Guillaume Genthial's sequence_tagging and Guillaume Lample's tagger projects.

Prerequisites

  • Python (3 or newer)
  • pip, virtualenv, make

Getting started

  1. Creating isolated environment with:
virtualenv -p /usr/bin/python3 virtual-env
source virtual-env/bin/activate
pip install -r requirements.txt

Hint: If you are done working, type deactivate to exit virtual environment.

  1. Download the word2vec vectors with
make word2vec

Alternatively, you can download them manually here and update the filename_word2vec entry in config.py. You can also choose not to load pretrained word vectors by changing the entry use_pretrained to False in model/config.py.

  1. Build the training data, train and evaluate the model with
make run

Details

Here is the breakdown of the commands executed in make run:

  1. Build vocab from the data and extract trimmed word2vec vectors according to the config in model/config.py.
python build_data.py
  1. Train the model with
python train.py
# Or redirect everything into a log file and detach the process by typing:
# python train.py >> out.log 2>&1 & disown
  1. Evaluate and interact with the model with
python evaluate.py

Data iterators and utils are in model/data_utils.py and the model with training/test procedures is in model/ner_model.py

Training Data

The training data must be in the following format (identical to the CoNLL2003 dataset).

A default test file is provided to help you getting started.

John B-PER
lives O
in O
New B-LOC
York I-LOC
. O

This O
is O
another O
sentence

Once you have produced your data files, change the parameters in config.py like

# dataset
filename_dev = "data/tr.testa.iobes"
filename_test = "data/tr.testb.iobes"
filename_train = "data/tr.train.iobes"

License

This project is licensed under the terms of the apache 2.0 license (as Tensorflow and derivatives). If used for research, citation would be appreciated.