COVID-19 Twitter Data Sentiment Analysis using Spark and Kafka

All commands are issued in the fyp-kafka namespace.

Deploying Kafka Cluster

Follow the readmes inside the kafka-deployment directory for deploying Kafka and Zookeeper.
Zookeeper should be deployed prior to deployment of Kafka.

Twitter Producer

Create a docker image of the twitter producer using the following commands

cd twitter-producer
sudo docker build -t twitter-producer .

Once the image is built, use the following Kubernetes commands we can deploy a pod and a service for the twitter-producer

sudo docker create $imageId
sudo docker commit $hash 10.3.2.48:5000/twitter-producer
sudo docker push 10.3.2.48:5000/twitter-producer
kubectl apply -f config.yaml

Dash Frontend

SSH into the master node on port 8050
Create a docker image of the dash frontend using the following commands

cd fyp-dash
sudo docker build -t fyp-dash .

Once the image is built, use the following Kubernetes commands we can deploy a pod and a service for the dash frontend.

sudo docker create $imageId
sudo docker commit $hash 10.3.2.48:5000/fyp-dash
sudo docker push 10.3.2.48:5000/fyp-dash
kubectl apply -f config.yaml

Run the following command to access the Dash UI at localhost:8050:

kubectl port-forward svc/fyp-dash-service 8050:8050

Spark Cluster Deployment

First start a python server on your local machine in this directory:

python3 -m http.server 8000

Use ngrok to make this globally accessible:

ngrok http 8000

The Spark file would be avaible at the https://ngrok_address/SparkStreaming/main.py
Download and extract the spark distribution on the master node by using the following command:

wget https://www.apache.org/dyn/closer.lua/spark/spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz
tar -xvf spark-2.4.7-bin-hadoop2.7.tgz

Create a new service account for Spark and grant it rbac to manage resources:

$ kubectl create serviceaccount spark
$ kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=fyp-kafka

Execute the following command inside the Spark directory on the master node while updating the ngrok address:

bin/spark-submit \
      --deploy-mode cluster \
      --master k8s://https://10.4.2.37:6443 \
      --conf spark.executor.memory=500m \
      --conf spark.driver.memory=4G \
      --conf spark.driver.cores=4 \
      --conf spark.executor.cores=2 \
      --conf spark.executor.instances=5 \
      --conf spark.app.name=spark-full \
      --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
      --conf spark.kubernetes.container.image=gcr.io/spark-operator/spark-py:v2.4.5 \
      --conf spark.kubernetes.namespace=fyp-kafka \
      --conf spark.kubernetes.pyspark.pythonVersion=3 \
      --jars https://repo1.maven.org/maven2/org/apache/spark/spark-streaming-kafka-0-8-assembly_2.11/2.4.5/spark-streaming-kafka-0-8-assembly_2.11-2.4.5.jar \
      https://ngrok_address/SparkStreaming/main.py

Run the following command on the master node to find the pod name of the Spark driver.

kubectl get pods

SSH into the master node on port 4040 and run the following command to access the Spark UI at localhost:4040:

kubectl port-forward $spark_driver_pod_name 4040:4040

Machine Learning

Follow the model-training/training.ipynb jupyter notebook to train and save the machine learning model.

rajatj9/COVID-19-Twitter-Sentiment-Analysis-Spark-Kafka

COVID-19 Twitter Data Sentiment Analysis using Spark and Kafka

Deploying Kafka Cluster

Twitter Producer

Dash Frontend

Spark Cluster Deployment

Machine Learning