/spark-data-standardization

A library for Spark that helps to stadardize any input data (DataFrame) to adhere to the provided schema.

Primary LanguageScalaApache License 2.0Apache-2.0

Spark Data Standardization Library

License Release

  • Dataframe in
  • Standardized Dataframe out

Usage

Needed Provided Dependencies

The library needs following dependencies to be included in your project

"org.apache.spark" %% "spark-core" % SPARK_VERSION,
"org.apache.spark" %% "spark-sql" % SPARK_VERSION,
"za.co.absa" %% s"spark-commons-spark${SPARK_MAJOR}.${SPARK_MINOR}" % "0.6.1",

Usage in SBT:

"za.co.absa" %% "spark-data-standardization" % VERSION 

Usage in Maven

Scala 2.11 Maven Central

<dependency>
   <groupId>za.co.absa</groupId>
   <artifactId>spark-data-standardization_2.11</artifactId>
   <version>${latest_version}</version>
</dependency>

Scala 2.12 Maven Central

<dependency>
   <groupId>za.co.absa</groupId>
   <artifactId>spark-data-standardization_2.12</artifactId>
   <version>${latest_version}</version>
</dependency>

Scala 2.13 Maven Central

<dependency>
   <groupId>za.co.absa</groupId>
   <artifactId>spark-data-standardization_2.13</artifactId>
   <version>${latest_version}</version>
</dependency>

Spark and Scala compatibility

Scala 2.11 Scala 2.12 Scala 2.13
Spark 2.4.7 3.2.1 3.2.1

How to Release

Please see this file for more details.

How to generate Code coverage report

sbt ++<scala.version> jacoco

Code coverage will be generated on path:

{project-root}/target/scala-{scala_version}/jacoco/report/html