A library to integrate Vector with Spark, allowing you to load Spark DataFrames/RDDs into Vector in parallel and to consume results of Vector based computations in Spark(SQL). This connector works with both Vector SMP and VectorH MPP.
Spark-Vector Connector Scaladocs.
This library requires:
- Vector(H) 5.0
- Spark 1.5.x
Spark-Vector connector is built with sbt. To build, run:
sbt assembly
This module can be added to Spark using the --jars
command line option. Spark shell example (assuming $SPARK_VECTOR
is the root directory of spark-vector):
spark-shell --jars $SPARK_VECTOR/target/spark_vector-assembly-1.1-SNAPSHOT.jar
Assuming that there is a Vector Installation on node vectorhost
, instance VI
and database databasename
sqlContext.sql("""CREATE TEMPORARY TABLE vector_table
USING com.actian.spark_vector.sql.DefaultSource
OPTIONS (
host "vectorhost",
instance "VI",
database "databasename",
table "vector_table"
)""")
and then to load data into Vector:
sqlContext.sql("insert into vector_table select * from spark_table")
... or to read Vector data in:
sqlContext.sql("select * from vector_table")
The OPTIONS
clause of the SparkSQL statement can contain:
Parameter | Required | Default | Notes |
---|---|---|---|
host | Yes | none | Host name of where Vector is located |
instance | Yes | none | Vector database instance identifier (two letters) |
database | Yes | none | Vector database name |
user | No | empty string | User name to use when connecting to Vector |
password | No | empty string | Password to use when connecting to Vector |
table | Yes | None | Vector target table |
loadpreSQL* | No | None | Query to execute before a load, in the same transaction. Multiple queries can be specified using different suffixes, e.g. loadpreSQL0, loadpreSQL1, etc. In this case, the query execution order is determined by the lexicographic order |
loadpostSQL* | No | None | Query to execute after a load, in the same transaction. Multiple queries can be specified using different suffixes, e.g. loadpostSQL0, loadpostSQL1, etc. In this case, the query execution order is determined by the lexicographic order |
The Spark-Vector loader is a command line client utility that provides the ability to load CSV,Parquet and ORC files through Spark into Vector, using the Spark-Vector connector.
sbt loader/assembly
Loading CSV files:
spark-submit --class com.actian.spark_vector.loader.Main $SPARK_VECTOR/loader/target/spark_vector_loader-assembly-1.1-SNAPSHOT.jar load csv -sf hdfs://namenode:port/tmp/file.csv
-vh vectorhost -vi VI -vd databasename -tt vector_table -sc " "
Loading Parquet files:
spark-submit --class com.actian.spark_vector.loader.Main $SPARK_VECTOR/loader/target/spark_vector_loader-assembly-1.1-SNAPSHOT.jar load parquet -sf hdfs://namenode:port/tmp/file.parquet
-vh vectorhost -vi VI -vd databasename -tt vector_table
Loading ORC files:
spark-submit --class com.actian.spark_vector.loader.Main $SPARK_VECTOR/loader/target/spark_vector_loader-assembly-1.1-SNAPSHOT.jar load orc -sf hdfs://namenode:port/tmp/file.orc
-vh vectorhost -vi VI -vd databasename -tt vector_table
The entire list of options is available here or can be retrieved with:
spark-submit --class com.actian.spark_vector.loader.Main $SPARK_VECTOR/loader/target/spark_vector_loader-assembly-1.1-SNAPSHOT.jar load --help
The Spark-Vector provider is a Spark application serves Vector requests for external data sources.
sbt provider/assembly
sbt '; set javaOptions ++= "-Dvector.host=vectorhost -Dvector.instance=VI -Dvector.database=databasename -Dvector.user= -Dvector.password=".split(" ").toSeq; test'
sbt loader/test
Copyright 2016 Actian Corporation.
Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0