nagarajuerigi/pyspark_pipeline

Jupyter Notebook

pyspark_pipeline

Installation

Clone the repo

git clone https://github.com/nagarajuerigi/pyspark_pipeline.git

Pre Processing steps

Upload the metamodel lookup file and data file to data folder
Upload notebook init_setup.ipynb, process_csv.ipynb to your work space in Databricks Community Edition to get started
Create Spark Cluster in Databricks with latest Run-time.
Run init_setup.ipynb creates the Landing dir, Lookup dir
Copies the file from Data dir to Lookup & Landing dir

Start adding more components/functionalities.

Run process_csv.ipynb to test the flow and add more functionalities and add changes to your feature branch

Working on package/library creation in feature branches with our Hackathon team