Repository for the paper "TempEL: Linking Dynamically Evolving and Newly Emerging Entities", accepted to NeurIPS 2022 Dataset and Benchmarks:
@inproceedings{zaporojets2022tempel,
title={TempEL: Linking Dynamically Evolving and Newly Emerging Entities},
author={Zaporojets, Klim and Kaffee, Lucie-Aim{\'e}e and Deleu, Johannes and Demeester, Thomas and Develder, Chris and Augenstein, Isabelle},
year={2022},
booktitle={Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track}
}
The dataset can be downloaded from this link
Alternatively, the TempEL dataset can be downloaded using wget
linux command:
wget -O tempel_dataset.zip https://cloud.ilabt.imec.be/index.php/s/RinXy8NgqdW58RW/download
Currently, the dataset is available in the following two formats:
tempel_v1.0_all
(78.4 GB): the dataset files in this directory include the complete text of Wikipedia anchor and target pages for each of the instances.tempel_v1.0_only_bert_tokenized
(8.6 GB): the dataset files in this directory only include the truncated bert tokenization used in our baseline model.
Below we describe details to re-generate the TempEL dataset, which can be used to create other TempEL versions using different hyperparameters, such as the time span between the temporal snapshots.
Step 0: download the necessary Wikipedia tables and history logs The disk space requirement of this step is of 400 GB.
The TempEL creation pipeline begins with history logs of Wikipedia as well as Wikipedia sql tables with auxiliary information (e.g., tables listing all the redirect pages in Wikipedia), which can be downloaded from our cloud storage using the following commands:
./scripts/dataset/download_wiki_history_logs.sh
The cloud console with all the Wikipedia files can also be accessed at the following link: https://cloud.ilabt.imec.be/index.php/s/8KtT3HxDHybsNmt.
Step 1: snapshot content extraction
The wikitext content is parsed
for each of the temporal snapshots.
The hyperparameter configuration is located in
experiments/snapshot_extraction/snap_20220515/config/s01_config_content.json
file. Execute:
./scripts/dataset/snapshot_content_extraction.sh
The script above will generate the experiments/snapshot_extraction/snap_20220515/output/wikipedia_evolution_content.jsonl
file with cleaned wikipedia content for each of the entities in snapshots.
It will also create a number of stat files in experiments/snapshot_extraction/snap_20220515/output/stats/snapshot_stats/
directory that are later use to
calculate attributes (such as the mention prior) used during the creation of TempEL dataset.
These files are divided in three categories:
page_info_yyyy-01-01T00:00:00.csv
: information necessary to connect Wikipedia page_id to wikipedia_title and wikidata_qid with some additional attributes such as the creation date, the last revision date and the content_length of the page for snapshot of the yearyyyy
.page_link_stats_yyyy-01-01T00:00:00.csv
: contains the details of anchor mention and target entities these mentions are linked to (using both Wikipedia title as well as Wikidata QID identifiers) for the snapshot of the yearyyyy
.title_changes.csv
: contains the temporal title changes of Wikipedia pages. This allows to identify mentions linked to the same entity even if the title of this one is changed at some point in time.
--nr_threads_processor and --nr_threads_reader: these parameters define the nr of parallel threads used to
process the files in data/wikipedia_dump/enwiki_20220201/pages_meta_history/
. The --nr_threads_reader defines
the number of parallel threads that read each of the files in the input directory. The --nr_threads_processor defines the
number of parallel threads that process the entity history.
For ideal performance, the sum of --nr_threads_processor and --nr_threads_reader should not exceed the total
number of available CPUs.
Step 2.a: inverse dictionary and mention-driven statistics generation
In this step, we generate inverse dictionary index that maps a target entities to Wikipedia
pages with mentions linked to those target entities. Additional mention-driven statistics (e.g.,
mention prior, edit distance between mention and target entity title, etc.) are also generated and
serialized to disc as .csv files. The detailed configuration of the hyperparameters used for this
step is located in the experiments/snapshot_extraction/snap_20220515/config/s02_alias_table_generator.json
file.
The following is the python command to generate the above mentioned inverse dictionary and mention-driven
statistics:
./scripts/dataset/detect_mentions.sh
Step 2.b: detecting redirects in page history
Some Wikipedia pages are redirect pages in one or more of the snapshots. We do not include these page in TempEL dataset, since we are interested in pages with actual content and not redirects pointing to other pages. The following python command generates output file containing Wikidata QID and Wikipedia title of such redirect pages:
./scripts/dataset/detect_redirects.sh
Alternative: download files from the cloud
It can take multiple days for the scripts in Step 1 and Step 2 to complete and generate all the files. Alternatively, these files can also be downloaded from our cloud executing:
./scripts/dataset/download_extracted_snapshots.sh
The files can also be accessed via the following link: https://cloud.ilabt.imec.be/index.php/s/Ytt2WDTJH5r3w4z
Step 3: TempEL dataset generation
Finally, TempEL dataset is generated with the parameters defined in
experiments/dataset_creation/data_20220515/config/s03_dataset_creator.json
.
The following is the python command:
./scripts/dataset/generate_dataset.sh
Note 1 set nr_threads
to be close to the number of the available CPUs.
Note 2 the selection of the entities and mentions to produce TempEL dataset is random. Therefore, the produced version of TempEL dataset will differ to the one used in the paper available in this link.
We use the bi-encoder BLINK model Wu et al., 2020 as a baseline in our paper. We use faiss facebook library for fast candidate entities retrieval.
The training was performed on TempEL dataset, which can be downloaded using the following command:
./scripts/dataset/download_tempel.sh
We train a separate bi-encoder model for each of the 10 temporal snapshots of TempEL. The training is performed on 4 parallel V100 GPUs. The following command will start the training process and expand automatically to all the available GPUs:
./scripts/biencoder/training_script.sh train_20230112_from_cloud
The shared training hyperparameters in experiments/models/blink/biencoder/train/train_20230112_from_cloud/config/
are divided in:
- Hyperparameters common to all the snapshots:
s04_config_train_parent.json
. - Hyperaparameters specific to each of the snapshots:
s04_config_train_yyyy.json
, withyyyy
being the year of the snapshot.
The trained models used to report the results in our work can be downloaded from
this link,
or alternatively using the following command, which will copy the models into
experiments/models/blink/biencoder/train/train_bi_20220630/output/models_ep_9_only/
directory:
./scripts/biencoder/download_models.sh
Each of the 10 trained models (one for each temporal snapshot of TempEL) in the previous subsection, is used to
encode the entities from all the 10 Wikipedia snapshots in order to compare the temporal drift in performance
of the models (see Table 2 of the paper). This results in a total of
100 encoded entity representation
tables (2.2 TB). The following command will start the process, executing the models configured in the
hyperparameter files located in experiments/models/blink/biencoder/encode/20220630/config
directory
(the command below uses the models in
experiments/models/blink/biencoder/train/train_bi_20220630/output/models_ep_9_only/
directory):
./scripts/biencoder/encoding_script.sh 20220630
The encodings can also be downloaded from this link,
or alternatively using the following command,
which will copy the encoded entities into experiments/models/blink/biencoder/encode/20220630/output/faiss/
directory:
./scripts/biencoder/download_entity_representations.sh
Note: The 100 encoded entity representation tables mentioned above are 2.2 TB in size.
During the evaluation step, each of the models (one for each temporal snapshot) is evaluated on TempEL dataset.
The predictions are saved in json files inside models/blink/biencoder/evaluate/20220630/output/predictions/
directory (configured in experiments/models/blink/biencoder/evaluate/20220630/config
files).
The following script will run the evaluation,
taking the entity encodings from experiments/models/blink/biencoder/encode/20220630/output/faiss/
directory
created in the previous step.
./scripts/biencoder/evaluation_script.sh 20220630
We tested the script above on one 32 GB V100 GPU.
The following script calculates accuracy@64 metric of the model predictions:
./scripts/stats/metrics_script.sh 20220630
The script to calculate the statistics reported in the paper will be available shortly, stay tuned!