Spark Configuration in window 10

  1. Downlaod all required file from below URL:
https://drive.google.com/drive/folders/1rBauyUVCRTbnKXgkMGh4l9MdIOVj8CQc?usp=sharing
  1. Install java .exe file

note: choose installtion path of java to "C:" drive

  1. Extract spark file in C drive

  2. Extract kafka file in C drive

  3. Add environment variable

ENVIRONMENT VARIABLE NAME VALUE
HADOOP_HOME C:\winutils
JAVA_HOME C:\Java\jdk1.8.0_202
SPARK_HOME C:\spark-3.0.3-bin-hadoop2.7
  1. select path variable from environment variable and add below values.
%SPARK_HOME%\bin
%HADOOP_HOME%\bin
%JAVA_HOME%\bin
C:\Java\jre1.8.0_281\bin

Create conda environment

  1. open conda terminal execute below command
conda create -n <env_name> python=3.8 -y
  1. select <env_name> created in previous step for project interpreter in pycharm.

  2. Install all necessary python library specified in requirements.txt file using below command.

pip install -r requirements.txt
  1. To upload your code to gihub repo
git init
git add .
git commit -m "first commit"
git branch -M main
git remote add origin <github_repo_link>
git push -u origin main

Train random forest model on insurance dataset

python training\stage_00_data_loader.py
python training\stage_01_data_validator.py
python training\stage_02_data_transformer.py
python training\stage_03_data_exporter.py
spark-submit training\stage_04_model_trainer.py

Prediction using random forest of insurance dataset

python prediction\stage_00_data_loader.py
python prediction\stage_01_data_validator.py
python prediction\stage_02_data_transformer.py
python prediction\stage_03_data_exporter.py
spark-submit prediction\stage_04_model_predictor.py

start zookeeper and kafka server

start kafka producer using below command

spark-submit csv_to_kafka.py

start pyspark consumer using below command

spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1  spark_consumer_from_kafka.py