Implementation of the CIKM-2022 paper "Large-scale Entity Alignment via Knowledge Graph Merging, Partitioning and Embedding".
Entity alignment is a crucial task in knowledge graph fusion. However, most entity alignment approaches have the scalability problem. Recent methods address this issue by dividing large KGs into small blocks for embedding and alignment learning in each. However, such a partitioning and learning process results in an excessive loss of structure and alignment. Therefore, in this work, we propose a scalable GNN-based entity alignment approach to reduce the structure and alignment loss from three perspectives. First, we propose a centrality-based subgraph generation algorithm to recall some landmark entities serving as the bridges between different subgraphs. Second, we introduce self-supervised entity reconstruction to recover entity representations from incomplete neighborhood subgraphs, and design cross-subgraph negative sampling to incorporate entities from other subgraphs in alignment learning. Third, during the inference process, we merge the embeddings of subgraphs to make a single space for alignment search. Experimental results on the benchmark OpenEA dataset and the proposed large DBpedia1M dataset verify the effectiveness of our approach.
We build our model based on Python and Tensorflow.
You can test our implementation by runing main_ea.py or .sh files. Please modify the path to your configuration at config folder.
As we choose Dual-AMN as our base-model, you can refer to the requirements of Dual-AMN to start the implementation of ours.
- Python 3.x (tested on Python 3.8)
- matplotlib==3.1.1
- networkx_metis==1.0
- scipy==1.3.1
- tqdm==4.60.0
- tensorflow==2.0.0
- networkx==2.3
- Keras==2.4.3
- dgl==0.5.3
- numpy==1.16.2
- torch==1.5.0
- torch_sparse==0.6.3
- xgboost==1.0.1
- torch_scatter==2.0.4
- faiss==1.5.3
- PYNVML==11.4.1
- scikit_learn==1.0.2
- torch_geometric==2.0.3
We construct a new large-scale dataset for EA. It contains millions of entities, and include a large number of non-matchable entities, which is coincident with the real KGs. The dataset can be downloaded from figshare.
We refer to the codes of these repos: OpenEA, Dual-AMN, LargeEA, LIME, ClusterGCN. Thanks for their great contributions!