spark_poc

Trying to re-create a datawarehouse solution using Spark

Stage 1 The input file is compared with existing file(snapshot) and records are updated / inserted according to SCD 2

Stage 2 Save the output of Stage 1 with a schema as avro

//To do Stage 3 Save the output of Stage 2 as Parquet