pyspark_pipeline

Installation

Clone the repo

git clone https://github.com/nagarajuerigi/pyspark_pipeline.git

Pre Processing steps

  1. Upload the metamodel lookup file and data file to data folder
  2. Upload notebook init_setup.ipynb, process_csv.ipynb to your work space in Databricks Community Edition to get started
  3. Create Spark Cluster in Databricks with latest Run-time.
  4. Run init_setup.ipynb creates the Landing dir, Lookup dir
  5. Copies the file from Data dir to Lookup & Landing dir

Start adding more components/functionalities.

Run process_csv.ipynb to test the flow and add more functionalities and add changes to your feature branch

Working on package/library creation in feature branches with our Hackathon team