Usage examples of spark-lucenerdd.
Examples with "real-world" datasets are available:
- DBLP vs ACM - DBLP academic articles versus ACM articles
- DBLP vs Scholar - DBLP academic articles versus Google Scholar articles
- Amazon vs Google - Amazon versus google product listings
- Abt vs Buy - Abt versus buy product listings
The datasets used for record linkage are available at here. A spark friendly version of the datasets (Parquet) is available at parquet.
This example loads all countries from a parquet file containing fields "name" and "shape" (shape is mostly polygons in WKT)
val allCountries = spark.read.parquet("data/spatial/countries-poly.parquet")
then, it load all capitals from a parquet file containing fields "name" and "shape" (shape is mostly points in WKT)
val capitals = spark.read.parquet("data/spatial/capitals.parquet")
A ShapeLuceneRDD instance is created on the countries and a linkageByRadius
is performed on the capitals. The output is presented in the logs.
Install Java, SBT and clone the project
git clone https://github.com/zouzias/spark-lucenerdd-examples.git
cd spark-lucenerdd-examples
sbt compile assembly
Download and extract apache spark under your home directory, update the spark-submit.sh
script accordingly and run
./spark-linkage-*.sh
to run the record linkage examples and ./spark-search-capitalts.sh
to run a search example.
Setup docker and assuming that you have a docker machine named default
, type
./startZeppelin.sh
To start an Apache Zeppelin with preloaded notebooks.