RELATER
is a Python package consisting of the
implementation of the publication Unsupervised
Graph-based Entity Resolution for Complex Entities**
which is under review.
Entity Resolution (ER) is the process of linking
records of the same entity across one or more
databases in the absence of unique entity
identifiers. RELATER
is an unsupervised graph-based
entity resolution framework that is focused on resolving
the challenges associated with resolving complex entities.
We propose a global method to propagate link decisions by
propagating attribute values and constraints to
capture changing attribute values and different
relationships, a method for leveraging ambiguity in
the ER process, an adaptive method of incorporating
relationship structure, and a dynamic refinement step
to improve record clusters by removing likely wrong
links. RELATER
can be employed to resolve records of
both basic and complex entities.
To run the RELATER
framework on bibliographic data, for
example DBLP-ACM data set, you should run the following;
python -m er.bib_er dblp-acm1 $t_a $t_b $t_m $gamma $t_n
where $t_a, $t_b, $t_m, $gamma, and $t_n are the atomic node threshold, bootstrapping threshold, merging threshold, the weight distribution in Equation (3), and threshold for minimum number of nodes in a cluster to split by bridges, respectively.
Based on the domain knowledge as specified in the paper, we set the following temporal constraints. Since the constraints are data set specific, we show these constraints based on each data set.
- (ri.ρ = Bb) ^ (rj.ρ = Bm) ^ (15 ≥ YearTimeGap(ri, rj) ≥ 55) → ValidMerge(ri, rj)
- (ri.ρ = Bb) ^ (rj.ρ = Bf) ^ (15 ≥ YearTimeGap(ri, rj)) → ValidMerge(ri, rj)
- (ri.ρ = Bb) ^ (rj.ρ = Mm) ^ (15 ≥ YearTimeGap(ri, rj)) → ValidMerge(ri, rj)
- (ri.ρ = Bb) ^ (rj.ρ = Dd) ^ IsAfter(ri, rj) ^ AlmostSameBirthYears(ri, rj) → ValidMerge (ri, rj)
- (ri.ρ = Bb) ^ (rj.ρ = Ds) ^ (15 ≥ YearTimeGap(ri, rj)) → ValidMerge(ri, rj)
- (ri.ρ = Bb) ^ (rj.ρ = Dp) ^ (15 ≥ YearTimeGap(ri, rj)) → ValidMerge(ri, rj)
- (ri.ρ = Bb) ^ (rj.ρ = Mbp) ^ (30 ≥ YearTimeGap(ri, rj)) → ValidMerge(ri, rj)
- (ri.ρ = Bb) ^ (rj.ρ = Mgp) ^ (30 ≥ YearTimeGap(ri, rj)) → ValidMerge(ri, rj)
- (ri.ρ = Bp) ^ (rj.ρ = Bp) ^ (9 ≤ MonthTimeGap(ri, rj)) ^ AlmostSameMarriageYears(ri, rj) → ValidMerge (ri, rj)
- (ri.ρ = Bp) ^ (rj.ρ = Mm) ^ AlmostSameMarriageYears(ri, rj) → ValidMerge(ri, rj)
- (ri.ρ = Bm) ^ (rj.ρ = Dd) ^ IsAfter(ri, rj) → ValidMerge(ri, rj)
- (ri.ρ = Bf) ^ (rj.ρ = Dd) ^ (9 ≤ MonthTimeGap(ri, rj) ) → ValidMerge(ri, rj)
- (ri.ρ = Mm) ^ (rj.ρ = Mm) ^ AlmostSameBirthYears(ri, rj) → ValidMerge(ri, rj)
- (ri.ρ = Mm) ^ (rj.ρ = Dd) ^ IsAfter(ri, rj) ^ AlmostSameBirthYears(ri, rj) → ValidMerge (ri, rj)
- (ri.ρ = Mm) ^ (rj.ρ = Mbp) ^ (15 ≥ YearTimeGap(ri, rj)) → ValidMerge(ri, rj)
- (ri.ρ = Mm) ^ (rj.ρ = Mgp) ^ (15 ≥ YearTimeGap(ri, rj)) → ValidMerge(ri, rj)
Based on the domain knowledge as specified in the paper, we set the following link constraints. Since the constraints are data set specific, we show these constraints based on each data set.
- (ri.ρ = Bb) ^ (rj.ρ = Dd) ^ (|Links(ri,Dd)| = 0) ^ (|Links(rj,Bb)| = 0) → ValidMerge(ri, rj)
- (ri.ρ = Bb) ^ (rj.ρ = Bp) ^ (|Links(rj,Bb)| = 0) → ValidMerge(ri, rj)
- (ri.ρ = Bb) ^ (rj.ρ = Mm) ^ (|Links(rj,Bb)| = 0) → ValidMerge(ri, rj)
- (ri.ρ = Bb) ^ (rj.ρ = Mbp) ^ (|Links(rj,Bb)| = 0) → ValidMerge(ri, rj)
- (ri.ρ = Bb) ^ (rj.ρ = Mgp) ^ (|Links(rj,Bb)| = 0) → ValidMerge(ri, rj)
- (ri.ρ = Bb) ^ (rj.ρ = Ds) ^ (|Links(rj,Bb)| = 0) → ValidMerge(ri, rj)
- (ri.ρ = Bb) ^ (rj.ρ = Dp) ^ (|Links(rj,Bb)| = 0) → ValidMerge(ri, rj)
- (ri.ρ = Bp) ^ (rj.ρ = Dd) ^ (|Links(ri,Dd)| = 0) → ValidMerge(ri, rj)
- (ri.ρ = Mm) ^ (rj.ρ = Dd) ^ (|Links(ri,Dd)| = 0) → ValidMerge(ri, rj)
- (ri.ρ = Mbp) ^ (rj.ρ = Dd) ^ (|Links(ri,Dd)| = 0) → ValidMerge(ri, rj)
- (ri.ρ = Mgp) ^ (rj.ρ = Dd) ^ (|Links(ri,Dd)| = 0) → ValidMerge(ri, rj)
- (ri.ρ = Ds) ^ (rj.ρ = Dd) ^ (|Links(ri,Dd)| = 0) → ValidMerge(ri, rj)
- (ri.ρ = Dp) ^ (rj.ρ = Dd) ^ (|Links(ri,Dd)| = 0) → ValidMerge(ri, rj)
- (ri.ρ = F) ^ (rj.ρ = F) ^ (|Links(ri,F)| = 0) ^ (|Links(rj,F)| = 0) → ValidMerge(ri, rj)
- (ri.ρ = M) ^ (rj.ρ = M) ^ (|Links(ri,M)| = 0) ^ (|Links(rj,M)| = 0) → ValidMerge(ri, rj)
- (ri.ρ = C) ^ (rj.ρ = C) ^ (|Links(ri,C)| = 0) ^ (|Links(rj,C)| = 0) → ValidMerge(ri, rj)
Directory | Contains.. |
---|---|
common/ | Utility functions |
data/ | Methods to retrieve and pre process data |
er/ | ER algorithms proposed in RELATER |
febrl/ | Methods to calculate similarities from febrl |
The RELATER
package requires the following python packages to be installed:
Contact the author of the package: nishadi.kirielle@anu.edu.au