GENRE: A Python repository from ii-research-yu

The GENRE (Generarive ENtity REtrieval) system as presented in Autoregressive Entity Retrieval implemented in pytorch.

@article{de2020autoregressive,
  title={Autoregressive Entity Retrieval},
  author={De Cao, Nicola and Izacard, Gautier and Riedel, Sebastian and Petroni, Fabio},
  journal={arXiv preprint arXiv:2010.00904},
  year={2020}
}

Please consider citing our work if you use code from this repository.

In a nutshell, GENRE uses a sequence-to-sequence approach to entity retrieval (e.g., linking), based on fine-tuned BART architecture. GENRE performs retrieval generating the unique entity name conditioned on the input text using constrained beam search to only generate valid identifiers. Here an example of generation for Wikipedia page retrieval for open-domain question answering:

For end-to-end entity linking GENRE re-generates the input text annoted with a markup:

GENRE achieves state-of-the-art results on multiple datasets.

Main dependencies

python>=3.7
pytorch>=1.6
fairseq>=0.10 (for training -- optional for inference)
transformers>=4.0 (optional for inference)

Usage

See examples on how to use GENRE for both pytorch fairseq and huggingface transformers:

For pytorch/fairseq
For huggingface/transformers

Generally, after importing and loading the model, you would generate predictions (in this example for Entity Disambiguation) with a simple call like:

model.sample(
    sentences=[
        "[START_ENT] Armstrong [END_ENT] was the first man on the Moon."
    ]
)

[[{'text': 'Neil Armstrong', 'logprob': tensor(-0.1443)},
  {'text': 'William Armstrong', 'logprob': tensor(-1.4650)},
  {'text': 'Scott Armstrong', 'logprob': tensor(-1.7311)},
  {'text': 'Arthur Armstrong', 'logprob': tensor(-1.7356)},
  {'text': 'Rob Armstrong', 'logprob': tensor(-1.7426)}]]

NOTE: we used fairseq for all experiments in the paper. The huggingface/transformers models are obtained with a conversion script similar to this. Therefore results might differ.

Models

Use the link above to download models in .tar.gz format and then tar -zxvf <FILENAME> do uncompress. As an alternative use this script to dowload all of them.

Entity Disambiguation

Training Dataset	pytorch / fairseq	huggingface / transformers
BLINK	fairseq_entity_disambiguation_blink	hf_entity_disambiguation_blink
BLINK + AidaYago2	fairseq_entity_disambiguation_aidayago	hf_entity_disambiguation_aidayago

End-to-End Entity Linking

Training Dataset	pytorch / fairseq	huggingface / transformers
WIKIPEDIA	fairseq_e2e_entity_linking_wiki_abs	hf_e2e_entity_linking_wiki_abs
WIKIPEDIA + AidaYago2	fairseq_e2e_entity_linking_aidayago	hf_e2e_entity_linking_aidayago

Document Retieval

Training Dataset	pytorch / fairseq	huggingface / transformers
KILT	fairseq_wikipage_retrieval	hf_wikipage_retrieval

See here examples to load the models and make inference.

Dataset

Use the link above to download datasets. As an alternative use this script to dowload all of them. These dataset (except BLINK data) are a pre-processed version of Phong Le and Ivan Titov (2018) data availabe here. BLINK data taken from here.

Entity Disambiguation (train / dev)

BLINK train (9,000,000 lines, 11GiB)
BLINK dev (10,000 lines, 13MiB)
AIDA-YAGO2 train (18,448 lines, 56MiB)
AIDA-YAGO2 dev (4,791 lines, 15MiB)

Entity Disambiguation (test)

ACE2004 (257 lines, 850KiB)
AQUAINT (727 lines, 2.0MiB)
AIDA-YAGO2 (4,485 lines, 14MiB)
MSNBC (656 lines, 1.9MiB)
WNED-CWEB (11,154 lines, 38MiB)
WNED-WIKI (6,821 lines, 19MiB)

Document Retieval

KILT for the these datasets please follow the download instruction on the KILT repository.

Pre-processing

To pre-process a KILT formatted dataset into source and target files as expected from fairseq use

python scripts/convert_kilt_to_fairseq.py $INPUT_FILENAME $OUTPUT_FOLDER

Then, to tokenize and binarize them as expected from fairseq use

./preprocess_fairseq.sh $DATASET_PATH $MODEL_PATH

note that this requires to have fairseq source code downloaded in the same folder as the genre repository (see here).

Trie from KILT Wikipedia titles

We also release the BPE prefix tree (trie) from KILT Wikipedia titles (kilt_titles_trie.pkl) that is based on the 2019/08/01 Wikipedia dump, downloadable in its raw format here. The trie contains ~5M titles and it is used to generate entites for all the KILT experiments.

Troubleshooting

If the module cannot be found, preface the python command with PYTHONPATH=.

Licence

GENRE is licensed under the CC-BY-NC 4.0 license. The text of the license can be found here.

ii-research-yu/GENRE

Main dependencies

Usage

Models

Entity Disambiguation

End-to-End Entity Linking

Document Retieval

Dataset

Entity Disambiguation (train / dev)

Entity Disambiguation (test)

Document Retieval

Pre-processing

Trie from KILT Wikipedia titles

Troubleshooting

Licence