Spark Kafka Streaming Application

Overview:

This project is a dockerised pub sub based application written in python to publish data using Kafka and consume/aggregate data using the Spark processing framework.

Requirement:

A dockerised kafka application that produces user behaviour data and pushes into its producer.
A dockerised pyspark based application that could read real time data from a kafka producer and perform aggregations on the raw data and perform aggregations real time.

Architecture:

The Architecture involves the following tech stack.

Kafka v1.4.7
Spark v2.4.4
Python v3.7.2
Docker

What can be done better

Push the data into a persistant storage like InfluxDB/MySQl/PostGres.
Stream real time aggregations from persistant storage systems into dashboards built in Grafana.

How to run this application

### Clone the repository [https://github.com/ashshetty90/spark-kafka-application.git]

### After cloning the repository, run the below command in the root directory of the project. Make sure docker is installed on your machine.
$ docker-compose up -d
### The above command will run the kafka producer and the spark consumer inside a docker container.

$ docker container ps 

### This would show all the currently running containers inside docker

$ docker logs <container_id>

The logs for each of the container can be seen by running the above command

ashshetty90/spark-kafka-application