Using Data source V2 in spark 2.4.0.

Repository goals

The goal of this repository is to provide an example of how to create a data source for Spark. It uses the new DataSourceV2 and is updated for Spark 2.4.0.

How is this repository organized

The repository contains different packages under com.example.sources. Each such package represents a different (progressive) example.

The code under the source (src/main) represents the code one would add in order to add the source to their project. Usage can be found under the test (src/test) of the package.

The project is built using sbt and is not aimed to create a real library. Instead it is used just to run the tests (using scalatest). These tests are aimed at providing usage example rather than a real test (and as such more often than not print the resulting dataframe instead of doing an actual test)

In addition to the sources packages, a commons package (com.example.common) package which includes utilities used in multiple examples.

For more information on the commons package see the package description

Read path

Example 1: Trivial reader

The purpose of this source is to explain the basics of creating a data source. It creates a source which always reads the same content which is built-in memory.

For a basic overview of a data source and the simple example see the source description

Example 2: Internal Row exploration

The purpose of this source is to go deeper into the means to create an InternalRow.

See the source description for more information.

Example 3: Reading from database with schema and partitions.

Provides a more realistic example using a mock database and includes configuration through options, supporting multiple partitions and user specified schema.

See the source description for more information.

Example 4: Predicate pushdown

=============================TODO=============================

Example 5: Column pruning

=============================TODO=============================

Example 6: Handling retries

=============================TODO=============================

Write path

Trivial example

=============================TODO=============================

Base example with database mocking

Add transactional capabilities to the writing

=============================TODO=============================

resources

The following resources can be found to get more information

Youtube deep dives

online tutorials:

Relevant Jira:

  • SPARK-15689: Original Data source API v2 (spark 2.3.0)
  • Changes for Spark 2.4.0:
    • SPARK-23323: The output commit coordinator is used by default to ensure only one attempt of each task commits.
    • SPARK-23325 and SPARK-24971: Readers should always produce InternalRow instead of Row or UnsafeRow; see SPARK-23325 for detail.
    • SPARK-24990: ReadSupportWithSchema was removed, the user-supplied schema option was added to ReadSupport.
    • SPARK-24073: Read splits are now called InputPartition and a few methods were also renamed for clarity.
    • SPARK-25127: SupportsPushDownCatalystFilters was removed because it leaked Expression in the public API. V2 always uses the Filter API now.
    • SPARK-24478: Push down is now done when converting the a physical plan.
  • Future: