Code for "Let's Stop Incorrect Comparisons in End-to-end Relation Extraction!", accepted at EMNLP 2020.
Because of differences in evaluation settings, the End-to-end Relation Extraction Literature contains incorrect comparisons and conclusions.
The goal of this code is to provide a clear setup to evaluate models using either the "Strict" or "Boundaries" settings as defined in (Bekoulis 2018) and quantify their difference.
As an example, we perform the unexplored ablation study of two recent developments:
- the introduction of pretrained Language Models (such as BERT or ELMo)
- modeling the NER task as a classification for every span in the sentence instead of IOBES sequence labeling
We propose to evaluate their impact on CoNLL04 and ACE05, with no overlapping entities.
The code is written in Python 3.6 with the following main dependencies:
- Pytorch 1.3.1
- numpy 1.18.1
- transformers 2.4.1
- tqdm
- (optional) tensorboard 2.2.1
glove_path='http://nlp.stanford.edu/data/glove.840B.300d.zip'
mkdir -p embeddings/glove.840B
curl -LO $glove_path
unzip glove.840B.300d.zip -d embeddings/glove.840B/
rm glove.840B.300d.zip
We provide CoNLL04 dataset in the data/
folder, as formatted and used by (Eberts 2020) (code).
It corresponds to the split released by (Gupta 2016) (code)
Due to licensing issues, we do not provide the ACE05 dataset.
The instructions and scripts to setup the dataset from (Miwa and Bansal 2016) (code) are in the ace_preprocessing/
folder.
Although more configurations can be tested with this code, we focused on two ablations:
-
The use of a pretrained language over non-contextualized representations with a BiLSTM :
- BERT Encoder : embedder is
bert-base
and no encoder flag - (GloVE + CharBILSTM) Embedding + BiLSTM Encoder : embedder is
word char
and encoder isbilstm
- BERT Encoder : embedder is
-
The use of a Span-level NER modules over an IOBES sequence tagging model :
- ner_decoder is
iobes
orspan
- ner_decoder is
To reproduce our setup run the following commands where $dataset
is either conll04
or ace05
in the code/
folder:
python train.py -ds $dataset -emb word char -enc bilstm -ner_dec iobes -d 0.1 -bs 8 -lr 5e-4 -s $seed
python train.py -ds $dataset -emb word char -enc bilstm -ner_dec span -d 0.1 -bs 8 -lr 5e-4 -s $seed
python train.py -ds $dataset -emb bert-base -ner_dec iobes -d 0.1 -bs 8 -lr 1e-5 -s $seed
python train.py -ds $dataset -emb bert-base -ner_dec span -d 0.1 -bs 8 -lr 1e-5 -s $seed
To train on the combination of train and dev sets, add the -m train+dev
flag after a first standard training with same parameters.
Note: We used seeds 0 to 4 for all our experiments. However, despite careful manual seeding, they are not exactly reproducible accross different GPU hardwares.
If you find any of this work useful, please cite our paper as follows:
@inproceedings{taille-etal-2020-lets,
title = "Let{'}s Stop Incorrect Comparisons in End-to-end Relation Extraction!",
author = "Taill{\'e}, Bruno and
Guigue, Vincent and
Scoutheeten, Geoffrey and
Gallinari, Patrick",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.emnlp-main.301",
pages = "3689--3701",
}