This project is a dockerised pub sub based application written in python to publish data using Kafka and consume/aggregate data using the Spark processing framework.
- A dockerised kafka application that produces user behaviour data and pushes into its producer.
- A dockerised pyspark based application that could read real time data from a kafka producer and perform aggregations on the raw data and perform aggregations real time.
The Architecture involves the following tech stack.
- Kafka v1.4.7
- Spark v2.4.4
- Python v3.7.2
- Docker
- Push the data into a persistant storage like InfluxDB/MySQl/PostGres.
- Stream real time aggregations from persistant storage systems into dashboards built in Grafana.
### Clone the repository [https://github.com/ashshetty90/spark-kafka-application.git]
### After cloning the repository, run the below command in the root directory of the project. Make sure docker is installed on your machine.
$ docker-compose up -d
### The above command will run the kafka producer and the spark consumer inside a docker container.
$ docker container ps
### This would show all the currently running containers inside docker
$ docker logs <container_id>
The logs for each of the container can be seen by running the above command