/spark-lucenerdd-examples

Examples of spark-lucenerdd

Primary LanguageScalaApache License 2.0Apache-2.0

spark-lucenerdd-examples

Examples of spark-lucenerdd.

Datasets and Entity Likage

The following pairs of datasets are used here to demonstrate the accuracy/quality of the record linkage methods. Note that the goal here is to demonstrate the user-friendliness of the spark-lucenerdd library and no optimization is attempted.

Dataset Domain Attributes Accuracy (top-1) References
DBLP vs ACM article Bibliographic title, authors, venue, year 0.98 Benchmark datasets for entity resolution
DBLP vs Scholar article Bibliographic title, authors, venue, year 0.953 Benchmark datasets for entity resolution
Amazon vs Google products E-commerce name, description, manufacturer, price 0.58 Benchmark datasets for entity resolution
Abt vs Buy products E-commerce name, description, manufacturer, price 0.64 Benchmark datasets for entity resolution

The reported accuracy above is by selecting as the linked entity: the first result from the top-K list of results.

All datasets are available in Spark friendly Parquet format here; original datasets are available here.

Spatial linkage between countries and capitals

This example loads all countries from a parquet file containing fields "name" and "shape" (shape is mostly polygons in WKT)

val allCountries = spark.read.parquet("data/spatial/countries-poly.parquet")

then, it load all capitals from a parquet file containing fields "name" and "shape" (shape is mostly points in WKT)

val capitals = spark.read.parquet("data/spatial/capitals.parquet")

A ShapeLuceneRDD instance is created on the countries and a linkageByRadius is performed on the capitals. The output is presented in the logs.

Development

Usage (spark-submit)

Install Java, SBT and clone the project

git clone https://github.com/zouzias/spark-lucenerdd-examples.git
cd spark-lucenerdd-examples
sbt compile assembly

Download and extract apache spark under your home directory, update the spark-submit.sh script accordingly and run

./spark-linkage-*.sh

to run the record linkage examples and ./spark-search-capitalts.sh to run a search example.

Usage (docker)

Setup docker and assuming that you have a docker machine named default, type

./startZeppelin.sh

To start an Apache Zeppelin with preloaded notebooks.