/spark-netezza

Netezza Connector for Apache Spark

Primary LanguageScalaApache License 2.0Apache-2.0

Netezza Connector for Apache Spark

A data source library to load data into Apache Spark using SQL DataFrames from IBM® Netezza® database. This data source library is implemented using Netezza external table mechanism to transfer data from the Netezza host system to the Apache Spark system optimally.

Binary download:

You can download the Netezza Connector for Apache Spark assembly jars from here:

Apache Spark Version Release # Binary Location
1.5.2+ v0.1.1 [spark-netezza-assembly-0.1.1.jar] (https://github.com/SparkTC/spark-netezza/releases/download/v0.1.1/spark-netezza-assembly-0.1.1.jar)

Usage

Specify Package

Spark-netezza connector is a registered package on spark-packages site, so it supports command line option --packages to resolve maven dependencies. You can use spark-shell, spark-sql, pyspark or spark-submit based on your scenario to invoke spark-netezza connector. Dependency on netezza jdbc driver needs to be provided.

For example, for spark-shell:

spark-shell --packages com.ibm.SparkTC:spark-netezza_2.10:0.1.1 --driver-class-path /path/to/nzjdbc.jar

Data Sources API

You can use spark-netezza connector via the Apache Spark Data Sources API in Scala, Java, Python or SQL, as follows:

Scala

import org.apache.spark.sql._

val sc = // existing SparkContext
val sqlContext = new SQLContext(sc)

val opts = Map("url" -> "jdbc:netezza://netezzahost:5480/database",
        "user" -> "username",
        "password" -> "password",
        "dbtable" -> "tablename",
        "numPartitions" -> "4")
val df = sqlContext.read.format("com.ibm.spark.netezza")
  .options(opts).load()

Java

import org.apache.spark.sql.*;

JavaSparkContext sc = // existing SparkContext
SQLContext sqlContext = new SQLContext(sc);

Map<String, String> opts = new HashMap<>();
opts.put("url", "jdbc:netezza://netezzahost:5480/database");
opts.put("user", "username");
opts.put("password", "password");
opts.put("dbtable", "tablename");
opts.put("numPartitions", "4");
DataFrame df = sqlContext.read()
             .format("com.ibm.spark.netezza")
             .options(opts)
             .load();

Python

from pyspark.sql import SQLContext

sc = # existing SparkContext
sqlContext = SQLContext(sc)

df = sqlContext.read \
  .format('com.ibm.spark.netezza') \
  .option('url','jdbc:netezza://netezzahost:5480/database') \
  .option('user','username') \
  .option('password','password') \
  .option('dbtable','tablename') \
  .option('numPartitions','4') \
  .load()

SQL

CREATE TABLE my_table
USING com.ibm.spark.netezza
OPTIONS (
  url 'jdbc:netezza://netezzahost:5480/database',
  user 'username',
  password 'password',
  dbtable 'tablename'
);

Configuration Overview

Configuration can be passed on DataFrame using option:

Name Description
url The url to connect with JDBC driver.
user Username to connect to the database.
password Password to connect to the database.
dbtable The table in the database to load.
numPartitions Number of partitions to specify to parallely execute data movement.

Building From Source

Scala 2.10

spark-netezza build supports Scala 2.10 by default, if Scala 2.11 artifact is needed, please refer to Scala 2.11 or Version Cross Build

Building General Artifacts

To generate regular binary, in the root directory run:

build/sbt package

To generate assembly jar, in the root directory run:

build/sbt assembly

The artifacts will be generated to:

spark-netezza/target/scala-{binary.version}/

To run the tests, in the root directory run:

build/sbt test

Scala 2.11

To build against scala 2.11, use '++' option with desired version number, for example:

build/sbt ++2.11.7 assembly
build/sbt ++2.11.7 test

Version Cross Build

To produce artifacts for both scala 2.10 and 2.11:

Start SBT:

 build/sbt

Run in the SBT shell:

 + package
 + test

Using sbt-spark-package Plugin

spark-netezza connector supports sbt-spark-package plugin, to publish to local ivy repository, run:

build/sbt spPublishLocal