approaches_spark

This repository contains Spark-based versions of BlockSplit and BlockSlicer. These are blocking techniques to reduce the entity matching search space.

Hadoop MapReduce versions of BlockSplit anf BlockSlicer were first proposed/used in the following papers:

  • MESTRE, Demetrio Gomes; PIRES, Carlos Eduardo. Efficient entity matching over multiple data sources with mapreduce. Journal of Information and Data Management, v. 5, n. 1, p. 40-40, 2014.
  • MESTRE, Demetrio Gomes; PIRES, Carlos Eduardo Santos. Improving load balancing for mapreduce-based entity matching. In: 2013 IEEE Symposium on Computers and Communications (ISCC). IEEE, 2013. p. 000618-000624.
  • KOLB, Lars; THOR, Andreas; RAHM, Erhard. Load balancing for mapreduce-based entity resolution. In: 2012 IEEE 28th international conference on data engineering. IEEE, 2012. p. 618-629.

Repository Contributors: