/deequ-spark-example

The sample repository shows how to use Deequ to perform data quality testing in Spark.

Primary LanguageScala

Data Quality Testing with Deequ in Spark

The sample repository shows how to use Deequ to perform data quality testing in Spark. This repository is referenced in the blog post Data Quality Testing with Deequ in Spark.

Setup

To begin, you need to have a working Scala development environment. If you don't, install Java, Scala and sbt (Scala Build Tool). For Linux x86 the installation would look as follows:

# Install Java (on Debian)
sudo apt install default-jre

# Install Coursier (Scala Version Manager)
curl -fL https://github.com/coursier/coursier/releases/latest/download/cs-x86_64-pc-linux.gz | gzip -d > cs && chmod +x cs && ./cs setup

# Install Scala 2.12 and sbt
cs install scala:2.12.15 && cs install scalac:2.12.15

Next, download a compatible Apache Spark distribution (version 3.3.x is recommended) and add the bin folder to your system path. If you can run spark-submit, you are all set.

# Download Spark
curl https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz --output hadoop.tgz
tar xvf hadoop.tgz
mv spark-3.3.2-bin-hadoop3 /usr/local/spark

# Add following line to your .bashrc (adds Spark to PATH)
export PATH="$PATH:/usr/local/spark/bin"

Usage

Starter Script

You will find an empty example Spark script that reads a CSV file and writes it in parquet format to the output path. It takes the input path, output path and a path for metric storage as command line arguments.

Compile the script with the following command, which will output the jar as target/scala-2.12/glue-deequ_2.12-0.1.0.jar.

sbt compile && sbt package

Run the job using

spark-submit \
	--class EmptyExample \  
	./target/scala-2.12/glue-deequ_2.12-0.1.0.jar \  
	"./data/iowa_liquor_sales_lite/year=2022/iowa_liquor_sales_01.csv" \  
	"./outputs/sales/iowa_liquor_sales_processed" \  
	"./outputs/dataquality/iowa_liquor_sales_processed"

Example Deequ Checks Script

Since we will be using the Deequ library, it must be added as a dependency to our project. While the library is already included in the project's dependencies, it is deliberately not bundled into the compiled jar. Instead, you can use the following command to extract it to the target/libs folder, or you can download it yourself from the maven repository.

sbt copyRuntimeDependencies

Pass the --jars option to Spark job, so the library is loaded at runtime:

spark-submit \
	--jars ./target/libs/deequ-2.0.3-spark-3.3.jar \  
	--class ExampleSpark \  
	./target/scala-2.12/glue-deequ_2.12-0.1.0.jar \  
	"./data/iowa_liquor_sales_lite/year=2022/iowa_liquor_sales_01.csv" \  
	"./outputs/sales/iowa_liquor_sales_processed" \  
	"./outputs/dataquality/iowa_liquor_sales_processed"  

After running the command, the output parquet files are stored in outputs/sales/iowa_liquor_sales_processed and can be inspected with pandas or data tools like tad.