Spark Vector Connector

A library to integrate Vector with Spark, allowing you to load Spark DataFrames/RDDs into Vector in parallel and to consume results of Vector based computations in Spark(SQL). This connector works with both Vector SMP and VectorH MPP.

API documentation

Spark-Vector Connector Scaladocs.

Requirements

This library requires:

Vector(H) 5.0
Spark 1.5.x

Building (from source)

Spark-Vector connector is built with sbt. To build, run:

sbt assembly

Using with Spark shell/submit

This module can be added to Spark using the --jars command line option. Spark shell example (assuming $SPARK_VECTOR is the root directory of spark-vector):

spark-shell --jars $SPARK_VECTOR/target/spark_vector-assembly-1.1-SNAPSHOT.jar

Assuming that there is a Vector Installation on node vectorhost, instance VI and database databasename

SparkSQL

sqlContext.sql("""CREATE TEMPORARY TABLE vector_table
USING com.actian.spark_vector.sql.DefaultSource
OPTIONS (
    host "vectorhost",
    instance "VI",
    database "databasename",
    table "vector_table"
)""")

and then to load data into Vector:

sqlContext.sql("insert into vector_table select * from spark_table")

... or to read Vector data in:

sqlContext.sql("select * from vector_table")

Options

The OPTIONS clause of the SparkSQL statement can contain:

Parameter	Required	Default	Notes
`host`	Yes	none	Host name of where Vector is located
`instance`	Yes	none	Vector database instance identifier (two letters)
`database`	Yes	none	Vector database name
`user`	No	empty string	User name to use when connecting to Vector
`password`	No	empty string	Password to use when connecting to Vector
`table`	Yes	None	Vector target table
`loadpreSQL*`	No	None	Query to execute before a load, in the same transaction. Multiple queries can be specified using different suffixes, e.g. loadpreSQL0, loadpreSQL1, etc. In this case, the query execution order is determined by the lexicographic order
`loadpostSQL*`	No	None	Query to execute after a load, in the same transaction. Multiple queries can be specified using different suffixes, e.g. loadpostSQL0, loadpostSQL1, etc. In this case, the query execution order is determined by the lexicographic order

Spark-Vector Loader

The Spark-Vector loader is a command line client utility that provides the ability to load CSV,Parquet and ORC files through Spark into Vector, using the Spark-Vector connector.

Building

sbt loader/assembly

API documentation

Loader scaladocs

Usage: CSV

Loading CSV files:

spark-submit --class com.actian.spark_vector.loader.Main $SPARK_VECTOR/loader/target/spark_vector_loader-assembly-1.1-SNAPSHOT.jar load csv -sf hdfs://namenode:port/tmp/file.csv
-vh vectorhost -vi VI -vd databasename -tt vector_table -sc " "

Usage: Parquet

Loading Parquet files:

spark-submit --class com.actian.spark_vector.loader.Main $SPARK_VECTOR/loader/target/spark_vector_loader-assembly-1.1-SNAPSHOT.jar load parquet -sf hdfs://namenode:port/tmp/file.parquet
-vh vectorhost -vi VI -vd databasename -tt vector_table

Usage: ORC

Loading ORC files:

spark-submit --class com.actian.spark_vector.loader.Main $SPARK_VECTOR/loader/target/spark_vector_loader-assembly-1.1-SNAPSHOT.jar load orc -sf hdfs://namenode:port/tmp/file.orc
-vh vectorhost -vi VI -vd databasename -tt vector_table

List of options

The entire list of options is available here or can be retrieved with:

spark-submit --class com.actian.spark_vector.loader.Main $SPARK_VECTOR/loader/target/spark_vector_loader-assembly-1.1-SNAPSHOT.jar load --help

Spark-Vector provider

The Spark-Vector provider is a Spark application serves Vector requests for external data sources.

Building

sbt provider/assembly

API docs

Provider scaladoc

Unit testing

sbt '; set javaOptions ++= "-Dvector.host=vectorhost -Dvector.instance=VI -Dvector.database=databasename -Dvector.user= -Dvector.password=".split(" ").toSeq; test'

Spark-Vector Loader

sbt loader/test

License

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

cbarca/spark-vector

Spark Vector Connector

API documentation

Requirements

Building (from source)

Using with Spark shell/submit

SparkSQL

Options

Spark-Vector Loader

Building

API documentation

Usage: CSV

Usage: Parquet

Usage: ORC

List of options

Spark-Vector provider

Building

API docs

Unit testing

Spark-Vector Loader

License