Implementation of the core expectation-maximization algorithm proposed in the following paper: Bohannon, P., Dalvi, N., Raghavan, M., & Olteanu, M. (2014). Deduplicating a Places Database. In WWW.
Unsupervised technique to learn distributions over words that are core to each company name and those that are "background" words. The problem of determining if two companies are the same is then transformed into computing P(core(c1) == core(c2)), that is the probability that the core set of words in company1 are equal to that of company2.
Some slides describing the algorithm and highlighting some of the results.
-
Setup you virtualenv and install necessary requirements (from requirements.txt)
-
Run the Starbucks example (from the paper). Modify em-dedup.py to point to the sample input file by replacing the line:
in_file = 'company_small.csv'
within_file = 'starbucks_test.csv'
, then run the script:> python em-dedup.py
. Learned probability distributions will be written to a file calledprobs.csv
where you can inspect the results. -
To run on a custom dataset, create a new input file (similar to
starbucks_test.csv
with your data, be sure to remove any existing 'pickle' files (e.g.core.pickle
), then run the script. monitor the log likelihood value as it gets print each iteration, if you see large, cyclical swings in this value then the algorithm was unable to converge, you may want to try with a smaller dataset first and see if your results are adequate before scaling up.