blstm-crf-ner

A NER model (B-LSTM + CRF + word embeddings) implemented using Tensorflow which is used to tag Turkish noisy data (tweets specifically!) without using any hand-crafted features or rules.

The model is very similar to Lample et al., Gungor, Onur et al. and Ma and Hovy. As a consequence, the source code is also heavily influenced by Guillaume Genthial's sequence_tagging and Guillaume Lample's tagger projects.

Prerequisites

Python (3 or newer)
pip, virtualenv, make

Getting started

Creating isolated environment with:

virtualenv -p /usr/bin/python3 virtual-env
source virtual-env/bin/activate
pip install -r requirements.txt

Hint: If you are done working, type deactivate to exit virtual environment.

Download the word2vec vectors with

make word2vec

Alternatively, you can download them manually here and update the filename_word2vec entry in config.py. You can also choose not to load pretrained word vectors by changing the entry use_pretrained to False in model/config.py.

Build the training data, train and evaluate the model with

make run

Details

Here is the breakdown of the commands executed in make run:

Build vocab from the data and extract trimmed word2vec vectors according to the config in model/config.py.

python build_data.py

Train the model with

python train.py
# Or redirect everything into a log file and detach the process by typing:
# python train.py >> out.log 2>&1 & disown

Evaluate and interact with the model with

python evaluate.py

Data iterators and utils are in model/data_utils.py and the model with training/test procedures is in model/ner_model.py

Training Data

The training data must be in the following format (identical to the CoNLL2003 dataset).

A default test file is provided to help you getting started.

John B-PER
lives O
in O
New B-LOC
York I-LOC
. O

This O
is O
another O
sentence

Once you have produced your data files, change the parameters in config.py like

# dataset
filename_dev = "data/tr.testa.iobes"
filename_test = "data/tr.testb.iobes"
filename_train = "data/tr.train.iobes"

License

This project is licensed under the terms of the apache 2.0 license (as Tensorflow and derivatives). If used for research, citation would be appreciated.

emrekgn/blstm-crf-ner