spark-lucenerdd-examples

Examples with "real-world" datasets are available:

The datasets used for record linkage are available at here. A spark friendly version of the datasets (Parquet) is available at parquet.

This example loads all countries from a parquet file containing fields "name" and "shape" (shape is mostly polygons in WKT)

val allCountries = spark.read.parquet("data/spatial/countries-poly.parquet")

then, it load all capitals from a parquet file containing fields "name" and "shape" (shape is mostly points in WKT)

val capitals = spark.read.parquet("data/spatial/capitals.parquet")

A ShapeLuceneRDD instance is created on the countries and a linkageByRadius is performed on the capitals. The output is presented in the logs.

Install Java, SBT and clone the project

git clone https://github.com/zouzias/spark-lucenerdd-examples.git
cd spark-lucenerdd-examples
sbt compile assembly

Download and extract apache spark under your home directory, update the spark-submit.sh script accordingly and run

./spark-linkage-*.sh

to run the record linkage examples and ./spark-search-capitalts.sh to run a search example.

Setup docker and assuming that you have a docker machine named default, type

./startZeppelin.sh

To start an Apache Zeppelin with preloaded notebooks.

Jiangwm/spark-lucenerdd-examples