The GENRE (Generarive ENtity REtrieval) system as presented in Autoregressive Entity Retrieval implemented in pytorch.
@article{de2020autoregressive,
title={Autoregressive Entity Retrieval},
author={De Cao, Nicola and Izacard, Gautier and Riedel, Sebastian and Petroni, Fabio},
journal={arXiv preprint arXiv:2010.00904},
year={2020}
}
Please consider citing our work if you use code from this repository.
In a nutshell, GENRE uses a sequence-to-sequence approach to entity retrieval (e.g., linking), based on fine-tuned BART architecture. GENRE performs retrieval generating the unique entity name conditioned on the input text using constrained beam search to only generate valid identifiers. Here an example of generation for Wikipedia page retrieval for open-domain question answering:
For end-to-end entity linking GENRE re-generates the input text annoted with a markup:
GENRE achieves state-of-the-art results on multiple datasets.
- python>=3.7
- pytorch>=1.6
- fairseq>=0.10 (for training -- optional for inference)
- transformers>=4.0 (optional for inference)
See examples on how to use GENRE for both pytorch fairseq and huggingface transformers:
Generally, after importing and loading the model, you would generate predictions (in this example for Entity Disambiguation) with a simple call like:
model.sample(
sentences=[
"[START_ENT] Armstrong [END_ENT] was the first man on the Moon."
]
)
[[{'text': 'Neil Armstrong', 'logprob': tensor(-0.1443)},
{'text': 'William Armstrong', 'logprob': tensor(-1.4650)},
{'text': 'Scott Armstrong', 'logprob': tensor(-1.7311)},
{'text': 'Arthur Armstrong', 'logprob': tensor(-1.7356)},
{'text': 'Rob Armstrong', 'logprob': tensor(-1.7426)}]]
NOTE: we used fairseq for all experiments in the paper. The huggingface/transformers models are obtained with a conversion script similar to this. Therefore results might differ.
Use the link above to download models in .tar.gz
format and then tar -zxvf <FILENAME>
do uncompress. As an alternative use this script to dowload all of them.
Training Dataset | pytorch / fairseq | huggingface / transformers |
---|---|---|
WIKIPEDIA | fairseq_e2e_entity_linking_wiki_abs | hf_e2e_entity_linking_wiki_abs |
WIKIPEDIA + AidaYago2 | fairseq_e2e_entity_linking_aidayago | hf_e2e_entity_linking_aidayago |
Training Dataset | pytorch / fairseq | huggingface / transformers |
---|---|---|
KILT | fairseq_wikipage_retrieval | hf_wikipage_retrieval |
See here examples to load the models and make inference.
Use the link above to download datasets. As an alternative use this script to dowload all of them. These dataset (except BLINK data) are a pre-processed version of Phong Le and Ivan Titov (2018) data availabe here. BLINK data taken from here.
- BLINK train (9,000,000 lines, 11GiB)
- BLINK dev (10,000 lines, 13MiB)
- AIDA-YAGO2 train (18,448 lines, 56MiB)
- AIDA-YAGO2 dev (4,791 lines, 15MiB)
- ACE2004 (257 lines, 850KiB)
- AQUAINT (727 lines, 2.0MiB)
- AIDA-YAGO2 (4,485 lines, 14MiB)
- MSNBC (656 lines, 1.9MiB)
- WNED-CWEB (11,154 lines, 38MiB)
- WNED-WIKI (6,821 lines, 19MiB)
- KILT for the these datasets please follow the download instruction on the KILT repository.
To pre-process a KILT formatted dataset into source and target files as expected from fairseq
use
python scripts/convert_kilt_to_fairseq.py $INPUT_FILENAME $OUTPUT_FOLDER
Then, to tokenize and binarize them as expected from fairseq
use
./preprocess_fairseq.sh $DATASET_PATH $MODEL_PATH
note that this requires to have fairseq
source code downloaded in the same folder as the genre
repository (see here).
We also release the BPE prefix tree (trie) from KILT Wikipedia titles (kilt_titles_trie.pkl) that is based on the 2019/08/01 Wikipedia dump, downloadable in its raw format here. The trie contains ~5M titles and it is used to generate entites for all the KILT experiments.
If the module cannot be found, preface the python command with PYTHONPATH=.
GENRE is licensed under the CC-BY-NC 4.0 license. The text of the license can be found here.