bsathyamur/spark-dataPipeline

Creating a data pipeline using spark

Jupyter Notebook

Building a spark data pipeline

The goal of the project is to use the walmar_sales.csv file data, upload to the azure blob container, perform the following processing steps in the file, split the files based on country (Uk and others) and upload to the output blob container.

Processing steps performed:

Convert null customer ID to Guest
Convert null description to Unlisted
Add quarter based on purchase date
Add invoice type column based on purchase amount - zero for return else purchase

Output files written to the output blob container