Source code and datasets for TOIS 2021 paper: Reinforcement Learning–based Collective Entity Alignment with Adaptive Features.
Initial datasets are from GCN-Align and JAPE.
- Will further optimize the codes in early June.
- Python=3.6
- Tensorflow-gpu=1.13.1
- Scipy
- Numpy
- Scikit-learn
- python-Levenshtein
There are nine datasets in this folder:
- zh_en/ja_en/fr_en from the DBP15K dataset
- en_fr/en_de/dbp_wd/dbp_yg from the SRPRS dataset
- DBP-FB dataset
- wd-imdb dataset created by us
Take the dataset DBP15K (ZH-EN) as an example, the folder "zh_en" contains:
- ent_ids_1: ids for entities in source KG (ZH);
- ent_ids_2: ids for entities in target KG (EN);
- ill_ent_ids: entity links encoded by ids;
- ref_ent_ids: entity links for testing/validation;
- sup_ent_ids: entity links for training;
- triples_1: relation triples encoded by ids in source KG (ZH);
- triples_2: relation triples encoded by ids in target KG (EN);
- zh_vectorList.json: the input entity feature matrix initialized by word vectors;
Regarding the Semantic Information, we obtain the entity name embeddings for DBP15K from RDGCN, the entity name embeddings for SRPRS from CEA. During the experiment, we found that for dbp_wd, dbp_yg and dbp_fb dataset, using the pre-trained fastText word embeddings with subword information are more effective and use them instead.
We also provide the datasets with entity name embeddings in Baidu Netdisk (Code: s6n1). You can download it and extract it into data/
directory.
- First generate the string similarity by running
python stringsim.py --lan dbp_wd
for entities in the test set and validation set in advance since it costs way too much time! - The dataset could be chosen from
zh_en, ja_en, fr_en, en_fr, en_de, dbp_wd, dbp_yg, dbp_fb
- Then run
python Ada.py --lan dbp_wd --method braycurtis --mode first --vali notgen
- The metric (method) could be chosen from
cosine, braycurtis, cityblock, euclidean
- Record the values of
Total Match
andTotal Match True
, which represent the entity pairs detected by preliminary treatment and the amount of correct matches. - Then run
python RL.py --lan fr_en --method braycurtis --type test
Averaged correct matches
represent the averaged number of correct matches detected by the RL algorithm for the rest of the entities.- Adding
Total Match True
andAveraged correct matches
, and dividing the value from the number of test entities will give the precision results.
Due to the instability of embedding-based methods, it is acceptable that the results fluctuate a little bit when running code repeatedly.
If you have any questions about reproduction, please feel free to email to zengweixin13@nudt.edu.cn.
If you use this model or code, please cite it as follows:
- Weixin Zeng, Xiang Zhao, Jiuyang Tang, Xuemin Lin, and Paul Groth. 2021. Reinforcement Learning–based Collective Entity Alignment with Adaptive Features. ACM Trans. Inf. Syst. 39, 3, Article 26 (May 2021), 31 pages. DOI:https://doi.org/10.1145/3446428
@article{10.1145/3446428,
author = {Zeng, Weixin and Zhao, Xiang and Tang, Jiuyang and Lin, Xuemin and Groth, Paul},
title = {Reinforcement Learning–Based Collective Entity Alignment with Adaptive Features},
year = {2021},
volume = {39},
number = {3},
url = {https://doi.org/10.1145/3446428},
doi = {10.1145/3446428},
journal = {ACM Trans. Inf. Syst.},
}