adventureworks-etl

ETL batch demo based on AdventureWorks database in postgres using Apache Spark's DataFrame & SQL APIs to:

Extract a subset of the tables from the AdventureWorks database (full db found in https://github.com/lorint/AdventureWorks-for-Postgres)
Persist tables in .parquet format
Perform basic cleaning for the columns
Denormalize into a star schema by merging related tables

How to execute

Must have sbt for scala, spark-submit and the AdventureWorks database running to execute the project, then:

sbt assembly
spark-submit --class Main < appJarFolder >/< appJar >

Where Main is the main file or entry point of the app, < appJarFolder > is the full path to the /target/scala-x.yz folder containing the assembled .jar file, and < appJar > is the name of this .jar, example:

spark-submit --class Main /path/to/target/scala-2.12/AdventureWorksETL-assembly-0.1.0-SNAPSHOT.jar

piotr1221/adventureworks-etl

adventureworks-etl

How to execute