spark-lucenerdd

Spark RDD with Apache Lucene's query capabilities.

The main abstractions are special types of RDD called LuceneRDD, FacetedLuceneRDD and ShapeLuceneRDD, which instantiate a Lucene index on each Spark executor. These RDDs distribute search queries and aggregate search results between the Spark driver and its executors. Currently, the following queries are supported:

Operation	Syntax	Description
Term Query	`LuceneRDD.termQuery(field, query, topK)`	Exact term search
Fuzzy Query	`LuceneRDD.fuzzyQuery(field, query, maxEdits, topK)`	Fuzzy term search
Phrase Query	`LuceneRDD.phraseQuery(field, query, topK)`	Phrase search
Prefix Query	`LuceneRDD.prefixSearch(field, prefix, topK)`	Prefix search
Query Parser	`LuceneRDD.query(queryString, topK)`	Query parser search
Faceted Search	`FacetedLuceneRDD.facetQuery(queryString, field, topK)`	Faceted Search
Record Linkage	`LuceneRDD.link(otherEntity: RDD[T], linkageFct: T => searchQuery, topK)`	Record linkage via Lucene queries
Circle Search	`ShapeLuceneRDD.circleSearch((x,y), radius, topK)`	Search within radius
Bbox Search	`ShapeLuceneRDD.bboxSearch(lowerLeft, upperLeft, topK)`	Bounding box
Spatial Linkage	`ShapeLuceneRDD.linkByRadius(RDD[T], linkage: T => (x,y), radius, topK)`	Spatial radius linkage

Using the query parser, you can perform prefix queries, fuzzy queries, prefix queries, etc and any combination of those. For more information on using Lucene's query parser, see Query Parser.

Examples

Here are a few examples using LuceneRDD for full text search, spatial search and record linkage. All examples exploit Lucene's flexible query language. For spatial search, lucene-spatial and jts are required.

For more, check the wiki. More examples are available at examples and performance evaluation examples on AWS can be found here.

Presentations

For an overview of the library, check these ScalaIO 2016 Slides.

Linking

You can link against this library (for Spark 1.4+) in your program at the following coordinates:

Using SBT:

libraryDependencies += "org.zouzias" %% "spark-lucenerdd" % "0.3.7"

Using Maven:

<dependency>
    <groupId>org.zouzias</groupId>
    <artifactId>spark-lucenerdd_2.11</artifactId>
    <version>0.3.7</version>
</dependency>

This library can also be added to Spark jobs launched through spark-shell or spark-submit by using the --packages command line option. For example, to include it when starting the spark shell:

$ bin/spark-shell --packages org.zouzias:spark-lucenerdd_2.11:0.3.7

Unlike using --jars, using --packages ensures that this library and its dependencies will be added to the classpath. The --packages argument can also be used with bin/spark-submit.

Compatibility

The project has the following compatibility with Apache Spark:

Artifact	Release Date	Spark compatibility	Notes	Status
0.3.8-SNAPSHOT		>= 2.4.2, JVM 8	develop	Under Development
0.3.7	2019-04-26	>= 2.4.2, JVM 8	tag v.0.3.7	Released
0.3.6	2019-03-11	>= 2.4.0, JVM 8	tag v0.3.6	Released
0.2.8	2017-05-30	2.1.x, JVM 7	tag v0.2.8	Released
0.1.0	2016-09-26	1.4.x, 1.5.x, 1.6.x	tag v0.1.0	Cross-released with 2.10/2.11

Project Status and Limitations

Implicit conversions for the primitive types (Int, Float, Double, Long, String) are supported. Moreover, implicit conversions for all product types (i.e., tuples and case classes) of the above primitives are supported. Implicits for tuples default the field names to "_1", "_2", "_3, ... following Scala's naming conventions for tuples. In addition, implicits for most Spark DataFrame types are supported (MapType and boolean are missing).

Custom Case Classes

If you want to use your own custom class with LuceneRDD you can do it provided that your class member types are one of the primitive types (Int, Float, Double, Long, String).

For more details, see LuceneRDDCustomcaseClassImplicits under the tests directory.

Development

Docker

A docker compose script is setup with some preliminary notebook in Zeppelin, run

docker-compose up

For more LuceneRDD examples on Zeppelin, check these examples

Build from Source

Install Java, SBT and clone the project

git clone https://github.com/zouzias/spark-lucenerdd.git
cd spark-lucenerdd
sbt compile assembly

The above will create an assembly jar containing spark-lucenerdd functionality under target/scala-*/spark-lucenerdd-assembly-*.jar

devjyotipatra/spark-lucenerdd