TempEL

Repository for the paper "TempEL: Linking Dynamically Evolving and Newly Emerging Entities", accepted to NeurIPS 2022 Dataset and Benchmarks:

@inproceedings{zaporojets2022tempel,
  title={TempEL: Linking Dynamically Evolving and Newly Emerging Entities},
  author={Zaporojets, Klim and Kaffee, Lucie-Aim{\'e}e and Deleu, Johannes and Demeester, Thomas and Develder, Chris and Augenstein, Isabelle},
  year={2022},
  booktitle={Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track}
}

1 Dataset

The dataset can be downloaded from this link

Alternatively, the TempEL dataset can be downloaded using wget linux command:

wget -O tempel_dataset.zip https://cloud.ilabt.imec.be/index.php/s/RinXy8NgqdW58RW/download

Currently, the dataset is available in the following two formats:

tempel_v1.0_all (78.4 GB): the dataset files in this directory include the complete text of Wikipedia anchor and target pages for each of the instances.
tempel_v1.0_only_bert_tokenized (8.6 GB): the dataset files in this directory only include the truncated bert tokenization used in our baseline model.

2 Dataset creation, code and models

Below we describe details to re-generate the TempEL dataset, which can be used to create other TempEL versions using different hyperparameters, such as the time span between the temporal snapshots.

2.1 TempeEL dataset creation

Step 0: download the necessary Wikipedia tables and history logs The disk space requirement of this step is of 400 GB.

The TempEL creation pipeline begins with history logs of Wikipedia as well as Wikipedia sql tables with auxiliary information (e.g., tables listing all the redirect pages in Wikipedia), which can be downloaded from our cloud storage using the following commands:

./scripts/dataset/download_wiki_history_logs.sh

The cloud console with all the Wikipedia files can also be accessed at the following link: https://cloud.ilabt.imec.be/index.php/s/8KtT3HxDHybsNmt.

Step 1: snapshot content extraction

The wikitext content is parsed for each of the temporal snapshots. The hyperparameter configuration is located in experiments/snapshot_extraction/snap_20220515/config/s01_config_content.json file. Execute:

./scripts/dataset/snapshot_content_extraction.sh

The script above will generate the experiments/snapshot_extraction/snap_20220515/output/wikipedia_evolution_content.jsonl file with cleaned wikipedia content for each of the entities in snapshots. It will also create a number of stat files in experiments/snapshot_extraction/snap_20220515/output/stats/snapshot_stats/ directory that are later use to calculate attributes (such as the mention prior) used during the creation of TempEL dataset. These files are divided in three categories:

page_info_yyyy-01-01T00:00:00.csv: information necessary to connect Wikipedia page_id to wikipedia_title and wikidata_qid with some additional attributes such as the creation date, the last revision date and the content_length of the page for snapshot of the year yyyy.
page_link_stats_yyyy-01-01T00:00:00.csv: contains the details of anchor mention and target entities these mentions are linked to (using both Wikipedia title as well as Wikidata QID identifiers) for the snapshot of the year yyyy.
title_changes.csv: contains the temporal title changes of Wikipedia pages. This allows to identify mentions linked to the same entity even if the title of this one is changed at some point in time.

--nr_threads_processor and --nr_threads_reader: these parameters define the nr of parallel threads used to process the files in data/wikipedia_dump/enwiki_20220201/pages_meta_history/. The --nr_threads_reader defines the number of parallel threads that read each of the files in the input directory. The --nr_threads_processor defines the number of parallel threads that process the entity history. For ideal performance, the sum of --nr_threads_processor and --nr_threads_reader should not exceed the total number of available CPUs.

Step 2.a: inverse dictionary and mention-driven statistics generation

In this step, we generate inverse dictionary index that maps a target entities to Wikipedia pages with mentions linked to those target entities. Additional mention-driven statistics (e.g., mention prior, edit distance between mention and target entity title, etc.) are also generated and serialized to disc as .csv files. The detailed configuration of the hyperparameters used for this step is located in the experiments/snapshot_extraction/snap_20220515/config/s02_alias_table_generator.json file. The following is the python command to generate the above mentioned inverse dictionary and mention-driven statistics:

./scripts/dataset/detect_mentions.sh

Step 2.b: detecting redirects in page history

Some Wikipedia pages are redirect pages in one or more of the snapshots. We do not include these page in TempEL dataset, since we are interested in pages with actual content and not redirects pointing to other pages. The following python command generates output file containing Wikidata QID and Wikipedia title of such redirect pages:

./scripts/dataset/detect_redirects.sh

Alternative: download files from the cloud

It can take multiple days for the scripts in Step 1 and Step 2 to complete and generate all the files. Alternatively, these files can also be downloaded from our cloud executing:

./scripts/dataset/download_extracted_snapshots.sh

The files can also be accessed via the following link: https://cloud.ilabt.imec.be/index.php/s/Ytt2WDTJH5r3w4z

Step 3: TempEL dataset generation

Finally, TempEL dataset is generated with the parameters defined in experiments/dataset_creation/data_20220515/config/s03_dataset_creator.json. The following is the python command:

./scripts/dataset/generate_dataset.sh

Note 1 set nr_threads to be close to the number of the available CPUs.

Note 2 the selection of the entities and mentions to produce TempEL dataset is random. Therefore, the produced version of TempEL dataset will differ to the one used in the paper available in this link.

2.2 Models

We use the bi-encoder BLINK model Wu et al., 2020 as a baseline in our paper. We use faiss facebook library for fast candidate entities retrieval.

2.2.1 Training

The training was performed on TempEL dataset, which can be downloaded using the following command:

./scripts/dataset/download_tempel.sh

We train a separate bi-encoder model for each of the 10 temporal snapshots of TempEL. The training is performed on 4 parallel V100 GPUs. The following command will start the training process and expand automatically to all the available GPUs:

./scripts/biencoder/training_script.sh train_20230112_from_cloud

The shared training hyperparameters in experiments/models/blink/biencoder/train/train_20230112_from_cloud/config/ are divided in:

Hyperparameters common to all the snapshots: s04_config_train_parent.json.
Hyperaparameters specific to each of the snapshots: s04_config_train_yyyy.json, with yyyy being the year of the snapshot.

The trained models used to report the results in our work can be downloaded from this link, or alternatively using the following command, which will copy the models into experiments/models/blink/biencoder/train/train_bi_20220630/output/models_ep_9_only/ directory:

./scripts/biencoder/download_models.sh

2.2.2 Encoding

Each of the 10 trained models (one for each temporal snapshot of TempEL) in the previous subsection, is used to encode the entities from all the 10 Wikipedia snapshots in order to compare the temporal drift in performance of the models (see Table 2 of the paper). This results in a total of 100 encoded entity representation tables (2.2 TB). The following command will start the process, executing the models configured in the hyperparameter files located in experiments/models/blink/biencoder/encode/20220630/config directory (the command below uses the models in experiments/models/blink/biencoder/train/train_bi_20220630/output/models_ep_9_only/ directory):

./scripts/biencoder/encoding_script.sh 20220630

The encodings can also be downloaded from this link, or alternatively using the following command, which will copy the encoded entities into experiments/models/blink/biencoder/encode/20220630/output/faiss/ directory:

./scripts/biencoder/download_entity_representations.sh

Note: The 100 encoded entity representation tables mentioned above are 2.2 TB in size.

2.2.3 Evaluation

During the evaluation step, each of the models (one for each temporal snapshot) is evaluated on TempEL dataset. The predictions are saved in json files inside models/blink/biencoder/evaluate/20220630/output/predictions/ directory (configured in experiments/models/blink/biencoder/evaluate/20220630/config files). The following script will run the evaluation, taking the entity encodings from experiments/models/blink/biencoder/encode/20220630/output/faiss/ directory created in the previous step.

./scripts/biencoder/evaluation_script.sh 20220630

We tested the script above on one 32 GB V100 GPU.

2.2.4 Metrics

The following script calculates accuracy@64 metric of the model predictions:

./scripts/stats/metrics_script.sh 20220630

2.2.5 Statistics

The script to calculate the statistics reported in the paper will be available shortly, stay tuned!

sidney1994/TempEL