ETL batch demo based on AdventureWorks database in postgres using Apache Spark's DataFrame & SQL APIs to:
- Extract a subset of the tables from the AdventureWorks database (full db found in https://github.com/lorint/AdventureWorks-for-Postgres)
- Persist tables in .parquet format
- Perform basic cleaning for the columns
- Denormalize into a star schema by merging related tables
Must have sbt for scala, spark-submit and the AdventureWorks database running to execute the project, then:
- sbt assembly
- spark-submit --class Main < appJarFolder >/< appJar >
Where Main is the main file or entry point of the app, < appJarFolder > is the full path to the /target/scala-x.yz folder containing the assembled .jar file, and < appJar > is the name of this .jar, example:
- spark-submit --class Main /path/to/target/scala-2.12/AdventureWorksETL-assembly-0.1.0-SNAPSHOT.jar