A free python library for accurate and scaleable deduplication and entity-resolution.
Based on Mikhail Yuryevich Bilenko's Ph. D dissertation: Learnable Similarity Functions and their Application to Record Linkage and Clustering
Current solutions break easily, don’t scale, and require significant developer time. Our solution is robust, can handle a large volume of data, and can be trained by anyone.
- For more detail and overview, read the wiki
- Join our Google group for updates
- See our presentation at ChiPy
python setup.py install python examples/csv_example.py (use 'y', 'n' and 'u' keys to flag duplicates for active learning, 'f' when you are finished)
Unit tests of core dedupe functions
python test/test_dedupe.py
Test using canonical dataset from Bilenko's research
Using random sample data for training
python test/canonical_test.py
Using active learning for training
python test/canonical_test.py --active True
If something is not behaving intuitively, it is a bug, and should be reported. Report it here
- Fork the project.
- Make your feature addition or bug fix.
- Send us a pull request. Bonus points for topic branches.
Copyright (c) 2012 Forest Gregg and Derek Eder of Open City. Released under the MIT License.