Additional code to construct the two datasets for ACMMM'23 paper DRIN: A Dynamic Relation Interactive Network for Multimodal Entity Linking. If you are not interested in how we construct our dataset, you can just visit https://github.com/starreeze/drin to download our constructed dataset and get our code. If you find this helpful, please cite our paper.
This repo contains scripts for constructing the two dataset, as the original ones do not provide entity images.
The steps taken to construct the dataset:
-
We first download from Wikidata a file containing all entities (~1TB size);
-
Extract qid <-> entity pairs from the huge json file downloaded;
-
Apply fuzzy search to extract candidate entities for the provided mentions;
-
Use Wikidata API to search for top-10 images for each candidate entity;
-
Clean the images: select one best-quality image for each entity.
The following is some notes taken during development. Hope to be helpful to you if you want to construct a similar dataset from scratch.
Create qid -> entity name mapping, used by the both 2 follwing tasks. Read from json files ['entities']['Qxxxx']['sitelinks']['enwiki']['title']
Create mention -> list[qid] mapping
- edit_distance(mention name, entity name), fuzzy search same with sota
- min(edit_distance(mention name, name for name in search_wikidata_alias(entity name)))
Prepare input file with qids (a separate line for each qid), and then run
python spider.py
after specifying params; or
python spider.py -c
to retry/continue with previous qids where errors occurred.
qid->entity name->image label->image pageid and revid is not consistent between wikidata and wikipedia API
Just use qid -> entity name mapping created before.
wikimedia API. If no image returned, try alias of the entity.
wikimedia API