Let's Stop Incorrect Comparisons in End-to-End Relation Extraction!

Code for "Let's Stop Incorrect Comparisons in End-to-end Relation Extraction!", accepted at EMNLP 2020.

Description

Because of differences in evaluation settings, the End-to-end Relation Extraction Literature contains incorrect comparisons and conclusions.

The goal of this code is to provide a clear setup to evaluate models using either the "Strict" or "Boundaries" settings as defined in (Bekoulis 2018) and quantify their difference.

As an example, we perform the unexplored ablation study of two recent developments:

the introduction of pretrained Language Models (such as BERT or ELMo)
modeling the NER task as a classification for every span in the sentence instead of IOBES sequence labeling

We propose to evaluate their impact on CoNLL04 and ACE05, with no overlapping entities.

Requirements

The code is written in Python 3.6 with the following main dependencies:

Pytorch 1.3.1
numpy 1.18.1
transformers 2.4.1
tqdm
(optional) tensorboard 2.2.1

Data Setup and Preprocessing

1) Download Pretrained GloVe embeddings

glove_path='http://nlp.stanford.edu/data/glove.840B.300d.zip'
mkdir -p embeddings/glove.840B
curl -LO $glove_path
unzip glove.840B.300d.zip -d embeddings/glove.840B/
rm glove.840B.300d.zip

2) CoNLL04

We provide CoNLL04 dataset in the data/ folder, as formatted and used by (Eberts 2020) (code).
It corresponds to the split released by (Gupta 2016) (code)

3) ACE05

Due to licensing issues, we do not provide the ACE05 dataset.
The instructions and scripts to setup the dataset from (Miwa and Bansal 2016) (code) are in the ace_preprocessing/ folder.

Training

Although more configurations can be tested with this code, we focused on two ablations:

The use of a pretrained language over non-contextualized representations with a BiLSTM :
- BERT Encoder : embedder is bert-base and no encoder flag
- (GloVE + CharBILSTM) Embedding + BiLSTM Encoder : embedder is word char and encoder is bilstm
The use of a Span-level NER modules over an IOBES sequence tagging model :
- ner_decoder is iobes or span

To reproduce our setup run the following commands where $dataset is either conll04 or ace05 in the code/ folder:

(GloVE + charBiLSTM) + BiLSTM + IOBES NER + RE

python train.py -ds $dataset -emb word char -enc bilstm -ner_dec iobes -d 0.1 -bs 8 -lr 5e-4 -s $seed

(GloVE + charBiLSTM) + BiLSTM + Span NER + RE

python train.py -ds $dataset -emb word char -enc bilstm -ner_dec span -d 0.1 -bs 8 -lr 5e-4 -s $seed

BERT + IOBES NER + RE

python train.py -ds $dataset -emb bert-base -ner_dec iobes -d 0.1 -bs 8 -lr 1e-5 -s $seed

BERT + Span NER + RE

python train.py -ds $dataset -emb bert-base -ner_dec span -d 0.1 -bs 8 -lr 1e-5 -s $seed

To train on the combination of train and dev sets, add the -m train+dev flag after a first standard training with same parameters.

Note: We used seeds 0 to 4 for all our experiments. However, despite careful manual seeding, they are not exactly reproducible accross different GPU hardwares.

Reference

If you find any of this work useful, please cite our paper as follows:

@inproceedings{taille-etal-2020-lets,
    title = "Let{'}s Stop Incorrect Comparisons in End-to-end Relation Extraction!",
    author = "Taill{\'e}, Bruno  and
      Guigue, Vincent  and
      Scoutheeten, Geoffrey  and
      Gallinari, Patrick",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.301",
    pages = "3689--3701",
}

Grant-Rk/sincere