/seq-datasource-v2

Sequence Data Source for Apache Spark

Primary LanguageScalaApache License 2.0Apache-2.0

SeqDataSourceV2

GitHub release (latest SemVer) Spark version Build Status GitHub

The SeqDataSourceV2 package allows reading Hadoop Sequence File from Spark SQL.
It's compatible only with Spark 2.4

Features

  • The SeqDataSourceV2 automatically detects the type unlike the RDD API that requires prior knowledge.
  • The SeqDataSourceV2 is 1.3x faster than the RDD API (See Benchmark at SeqDataSourceV2Benchmark).

Supported types

The following list contains the type mapping and the supported types by this Data Source.
Some types support the vectorized read optimization (aka Arrow optimization)

Spark Types Spark (Vectorized Read Path) Hadoop
LongType Supported LongWritable
DoubleType Supported DoubleWritable
FloatType Supported FloatWritable
IntegerType Supported IntWritable
BooleanType Supported BooleanWritable
NullType Not Supported NullWritable
StringType Not Supported BytesWritable
StringType Not Supported Text

N.B:

  • The vectorized read path is disabled by default. You can turn it by setting spark.sql.seq.enableVectorizedReader to true.
val spark = SparkSession
          .builder()
          .master("local[1]")
          .config("spark.sql.seq.enableVectorizedReader", "true")
          .getOrCreate()
  • If one column doesn't support vectorized read path, the SeqDataSourceV2 will fall back to the normal read path.
    Example:

    • The following schema (key : IntegerType, value: FloatType) supports vectorized read path.
    • The following schema (key : IntegerType, value: StringType) doesn't support vectorized read path.
  • It's possible to control the number of rows of the batch in the vectorized read path with spark.sql.seq.columnarReaderBatchSize.
    By default, the size of the batch is 4096 rows.

Installation

Option 1: Include the jar in the Spark-Submit

You need to download the latest release from the packages page and include it in the spark-submit.

Example with spark-submit:

$ spark-submit --class Main --jars seq-datasource-v2-0.2.0.jar Example-SNAPSHOT.jar

Example with pyspark:

$ pyspark --jars seq-datasource-v2-0.2.0.jar

Option 2: Include the package in the Spark-Submit

You can directly include the package with pacakges parameters, you can find the latest release in the spark packages.

Example with spark-submit:

$ spark-submit --class Main --packages garawalid:seq-datasource-v2:0.2.0

Option 3: Import the package as a dependency

You can include the SeqDataSourceV2 as a dependency with Maven, the latest release is in the the packages page.

Example with Maven:

<dependency>
  <groupId>org.gwalid</groupId>
  <artifactId>seq-datasource-v2</artifactId>
  <version>0.2.0</version>
</dependency>

Usage

The SeqDataSourceV2 is compatible with all the API. Here are some examples with both Scala and Python API.

Scala API

    val spark = SparkSession.builder()
      .master("local[0]")
      .getOrCreate()

    val df = spark.read.format("seq").load("data.seq")
    df.show()

Python API

    df = spark.read.format("seq").load("data.seq")
    df.printSchema()

Schema
It's possible to pass a schema to DataFrame API. There are few rules around schema.

  • The filed names must be key and/or value.

The name key will project the key field of the Seq file. The same goes for the value

  • The filed type should match the type of the seq file.
    val schema = new StructType()
      .add("key", IntegerType, true)
      .add("value", LongType, true)
    val df = spark.read.format("seq").schema(schema).load("path")

Contributing

You are welcome to submit pull requests with any changes for this repository at any time. I'll be very glad to see any contributions.