The implementation of the paper "A Multi-task Approach for Named Entity Recognition on Social Media Data," which won the WNUT-2017 Shared Task.

A Multi-task Approach for Named Entity Recognition on Social Media Data

This repository shows the implementation of the system described in the paper A Multi-task Approach for Named Entity Recognition on Social Media Data. This system achieved the first place on the 3rd Workshop on User-generated Text (W-NUT) at the EMNLP 2017 conference.

System Overview

The system uses a Multi-task Neural Network as a feature extractor. The network is composed of a B-LSTM, CNN and a dense representation. When the network is trained, we transfer the learning to a Conditional Random Fields (CRF) classifier, which we use to make the final prediction. More information can be found in the paper.

  • Keras
  • Theano
  • CRF Suite



* Code and binary files are allocated under the embeddings/twitter/ directory. The binary file has to be downloaded and added manually, though

Repository Structure

|__ common/
        |__ representation.py --> functions to encode data (postags, gazetteers, etc.)
        |__ utilities.py --> general functions (read/write files, metrics, label vectorization, etc.)
|__ data/
        |__ emerging.dev.conll --> validation data in conll format  
        |__ emerging.dev.conll.preprocess.url --> validation data with URL's replaced with <URL> tag
        |__ emerging.dev.conll.preprocess.url.postag --> POS tags for the validation data
        |__ emerging.test.conll
        |__ emerging.test.conll.preprocess.url
        |__ emerging.test.conll.preprocess.url.postag
        |__ emerging.train.conll
        |__ emerging.train.conll.preprocess.url
        |__ emerging.train.conll.preprocess.url.postag
        |__ README.md --> Organizer's description of the data
|__ embeddings/
        |__ gazetteers/
                |__ one.token.check.emb --> representation of gazetteers at the word level 
        |__ twitter/
                |__ word2vec_twitter_model.bin --> Frederic Godin's word2vec model 
                |__ word2vecReader.py 
                |__ word2vecReaderUtils.py
|__ models/
        |__ crfsuite.py --> functions to train a CRF based on a NN model
        |__ network.py --> functions to build, train, and predict with a NN
|__ predictions/
        |__ eval.sh --> a wrapper of the official evaluation script
        |__ wnuteval.py --> official script for evaluation
        |__ submission2017 --> official submission of our system
|__ settings.py --> global variables such as paths and default tags 
|__ main.py --> the entire pipeline of our system


Running the evaluation script:

$ cd predictions 
$ ./eval.sh <prediction_file>




If you find the repository useful for your projects, please cite our paper as follows:

