MEL: Mannheim Entity Linking

MEL is a Python library whose goal is to provide an efficient and easy to use end-to-end Entity Linking system. Entity Linking is the task of linking mentions in free text to entities in a Knowledge Base (in our case Wikipedia). For example "Washington" can refer to https://en.wikipedia.org/wiki/George_Washington or https://en.wikipedia.org/wiki/Washington,_D.C. or even https://en.wikipedia.org/wiki/Federal_government_of_the_United_States.

MEL is comprised of three main components:

Mention detection using spacy.
Candidate generation based on nel.
Entity linking using an implementation of the approach described in Yamada et al.

By leveraging the best methods for each component, MEL is able to achieve close to state-of-the-art performance. An easy to setup flask server is also included.

Dependencies

Python 3 with Numpy
PyTorch
Spacy
Flask

Setup

Clone this repo.
We recommend creating a virtual enviroment for this project using conda or pipenv.
Install dependencies by running pip install -r requirements.txt.
Install spacy model with python -m spacy download en.
To use MEL, one needs several dicts that are stored as memory mapped files. These are hosted [here](mmap file link), we also provide a pre-trained model [here](conll model file link) trained on [CONLL](conll data link here). Downloading these files along with setting up of the project's data structure can be done using a shell script:

chmod +x bin/setup.sh
bin/setup.sh

Note: This will download ~4G of data.

Performance

We compare against the popular TagMe system and report F1 scores on the combined mention detection and entity linking task. For mention detection, any predicted mentions with over 80% overlap with a gold mention is considered a match. TagMe allows to filter Entity Linking using a threshold parameter, here we show results for three different values for a fair comparison. Here we show overall F1 score / linking accuracy.

Data Set	MEL	TagMe - Threshold 0.1	TagMe (Threshold 0.3)	TagMe (Threshold 0.5)
Conll-Dev	0.70 / 0.88	0.39 / 0.70	0.52 / 0.77	0.33 / 0.86
MSNBC	0.67 / 0.88	0.28 / 0.80	0.46 / 0.87	0.23 / 0.90

Train

If you want to train a new model, then you need to generate trainig, dev and test data in the format used by MEL. To accomplish this, we provide a script: gen_train_data.py. The input to this file is data in AIDA Conll-YAGO dataset format which we can't distribute due to licensing issues. Each document must start like so:

-DOCSTART- ([DOC ID])

The doc id can be anything you choose, but for dev documents it should include 'testa' and for test documents it should include 'testb'. Each line thereafter, should have five fields separated by tabs: Token, BIO tag, full name of mention, title of entity it refers to and wikipedia page URL. For example:

German B German Germany http://en.wikipedia.org/wiki/Germany

European B European Commision European_Commission http://en.wikipedia.org/wiki/European_Commission

Commision I European Commision European_Commission http://en.wikipedia.org/wiki/European_Commission

Each document should be demarcated by a blank line. A default config file is provided and can be used to train a new model on CPU like so

python train.py --my-config configs/default.yaml --use_cuda False --data_path data.

Flask server

Setting up a server is as easy as running

python app.py --data_path data --model conll_v0.1.pt

An example notebook on how to use the API is here.

Speed

MEL is efficient as it spends most of its compute time running either spacy's cython code or PyTorch's C code. Here we compare MEL's server against TagMe using their API.

API	words / second
MEL	1,248
TagMe	1,095

Contact

Rohit Gupta - rohitg1594@gmail.com

Samuel Broscheit - samuel.broscheit@gmail.com

References

Learning Distributed Representations of Texts and Entities from Knowledge Base, Yamada et al.
Entity Disambiguation with Web Links, Chisholm et al.

rohitg1594/mannheim-nel