/RELATER

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

RELATER

RELATER is a Python package consisting of the implementation of the publication Unsupervised Graph-based Entity Resolution for Complex Entities** which is under review.

Entity Resolution (ER) is the process of linking records of the same entity across one or more databases in the absence of unique entity identifiers. RELATER is an unsupervised graph-based entity resolution framework that is focused on resolving the challenges associated with resolving complex entities. We propose a global method to propagate link decisions by propagating attribute values and constraints to capture changing attribute values and different relationships, a method for leveraging ambiguity in the ER process, an adaptive method of incorporating relationship structure, and a dynamic refinement step to improve record clusters by removing likely wrong links. RELATER can be employed to resolve records of both basic and complex entities.

Usage

To run the RELATER framework on bibliographic data, for example DBLP-ACM data set, you should run the following;

python -m er.bib_er dblp-acm1 $t_a $t_b $t_m $gamma $t_n

where $t_a, $t_b, $t_m, $gamma, and $t_n are the atomic node threshold, bootstrapping threshold, merging threshold, the weight distribution in Equation (3), and threshold for minimum number of nodes in a cluster to split by bridges, respectively.

Settings

Temporal Constraints

Based on the domain knowledge as specified in the paper, we set the following temporal constraints. Since the constraints are data set specific, we show these constraints based on each data set.

IOS and KIL
  • (ri.ρ = Bb) ^ (rj.ρ = Bm) ^ (15 ≥ YearTimeGap(ri, rj) ≥ 55) → ValidMerge(ri, rj)
  • (ri.ρ = Bb) ^ (rj.ρ = Bf) ^ (15 ≥ YearTimeGap(ri, rj)) → ValidMerge(ri, rj)
  • (ri.ρ = Bb) ^ (rj.ρ = Mm) ^ (15 ≥ YearTimeGap(ri, rj)) → ValidMerge(ri, rj)
  • (ri.ρ = Bb) ^ (rj.ρ = Dd) ^ IsAfter(ri, rj) ^ AlmostSameBirthYears(ri, rj) → ValidMerge (ri, rj)
  • (ri.ρ = Bb) ^ (rj.ρ = Ds) ^ (15 ≥ YearTimeGap(ri, rj)) → ValidMerge(ri, rj)
  • (ri.ρ = Bb) ^ (rj.ρ = Dp) ^ (15 ≥ YearTimeGap(ri, rj)) → ValidMerge(ri, rj)
  • (ri.ρ = Bb) ^ (rj.ρ = Mbp) ^ (30 ≥ YearTimeGap(ri, rj)) → ValidMerge(ri, rj)
  • (ri.ρ = Bb) ^ (rj.ρ = Mgp) ^ (30 ≥ YearTimeGap(ri, rj)) → ValidMerge(ri, rj)
  • (ri.ρ = Bp) ^ (rj.ρ = Bp) ^ (9 ≤ MonthTimeGap(ri, rj)) ^ AlmostSameMarriageYears(ri, rj) → ValidMerge (ri, rj)
  • (ri.ρ = Bp) ^ (rj.ρ = Mm) ^ AlmostSameMarriageYears(ri, rj) → ValidMerge(ri, rj)
  • (ri.ρ = Bm) ^ (rj.ρ = Dd) ^ IsAfter(ri, rj) → ValidMerge(ri, rj)
  • (ri.ρ = Bf) ^ (rj.ρ = Dd) ^ (9 ≤ MonthTimeGap(ri, rj) ) → ValidMerge(ri, rj)
  • (ri.ρ = Mm) ^ (rj.ρ = Mm) ^ AlmostSameBirthYears(ri, rj) → ValidMerge(ri, rj)
  • (ri.ρ = Mm) ^ (rj.ρ = Dd) ^ IsAfter(ri, rj) ^ AlmostSameBirthYears(ri, rj) → ValidMerge (ri, rj)
  • (ri.ρ = Mm) ^ (rj.ρ = Mbp) ^ (15 ≥ YearTimeGap(ri, rj)) → ValidMerge(ri, rj)
  • (ri.ρ = Mm) ^ (rj.ρ = Mgp) ^ (15 ≥ YearTimeGap(ri, rj)) → ValidMerge(ri, rj)

Link Constraints

Based on the domain knowledge as specified in the paper, we set the following link constraints. Since the constraints are data set specific, we show these constraints based on each data set.

IOS and KIL
  • (ri.ρ = Bb) ^ (rj.ρ = Dd) ^ (|Links(ri,Dd)| = 0) ^ (|Links(rj,Bb)| = 0) → ValidMerge(ri, rj)
  • (ri.ρ = Bb) ^ (rj.ρ = Bp) ^ (|Links(rj,Bb)| = 0) → ValidMerge(ri, rj)
  • (ri.ρ = Bb) ^ (rj.ρ = Mm) ^ (|Links(rj,Bb)| = 0) → ValidMerge(ri, rj)
  • (ri.ρ = Bb) ^ (rj.ρ = Mbp) ^ (|Links(rj,Bb)| = 0) → ValidMerge(ri, rj)
  • (ri.ρ = Bb) ^ (rj.ρ = Mgp) ^ (|Links(rj,Bb)| = 0) → ValidMerge(ri, rj)
  • (ri.ρ = Bb) ^ (rj.ρ = Ds) ^ (|Links(rj,Bb)| = 0) → ValidMerge(ri, rj)
  • (ri.ρ = Bb) ^ (rj.ρ = Dp) ^ (|Links(rj,Bb)| = 0) → ValidMerge(ri, rj)
  • (ri.ρ = Bp) ^ (rj.ρ = Dd) ^ (|Links(ri,Dd)| = 0) → ValidMerge(ri, rj)
  • (ri.ρ = Mm) ^ (rj.ρ = Dd) ^ (|Links(ri,Dd)| = 0) → ValidMerge(ri, rj)
  • (ri.ρ = Mbp) ^ (rj.ρ = Dd) ^ (|Links(ri,Dd)| = 0) → ValidMerge(ri, rj)
  • (ri.ρ = Mgp) ^ (rj.ρ = Dd) ^ (|Links(ri,Dd)| = 0) → ValidMerge(ri, rj)
  • (ri.ρ = Ds) ^ (rj.ρ = Dd) ^ (|Links(ri,Dd)| = 0) → ValidMerge(ri, rj)
  • (ri.ρ = Dp) ^ (rj.ρ = Dd) ^ (|Links(ri,Dd)| = 0) → ValidMerge(ri, rj)
IPUMS
  • (ri.ρ = F) ^ (rj.ρ = F) ^ (|Links(ri,F)| = 0) ^ (|Links(rj,F)| = 0) → ValidMerge(ri, rj)
  • (ri.ρ = M) ^ (rj.ρ = M) ^ (|Links(ri,M)| = 0) ^ (|Links(rj,M)| = 0) → ValidMerge(ri, rj)
  • (ri.ρ = C) ^ (rj.ρ = C) ^ (|Links(ri,C)| = 0) ^ (|Links(rj,C)| = 0) → ValidMerge(ri, rj)

Package structure

Directory Contains..
common/ Utility functions
data/ Methods to retrieve and pre process data
er/ ER algorithms proposed in RELATER
febrl/ Methods to calculate similarities from febrl

Dependencies

The RELATER package requires the following python packages to be installed:

Contact

Contact the author of the package: nishadi.kirielle@anu.edu.au