Setting up the project
- Setup the project (pull into your local)
- Create a venv
- Run
pip install -r requirements.txt
Setting up and getting the Docker up and running
- Download docker from - https://www.docker.com/products/docker-desktop/
- Complete the docker setup
- Run docker desktop application in your local
- In the docker-compose.yml file in the project replace the ip-address with your local machine ip address
- To get the local machine IP run
ipconfig
command in cmd - In the cmd run
docker-compose up -d
You can see your docker running in the docker desktop application.
Running the producer
- Run the app/main/producer/mock_data_producer_app.py file from the project
- This will read the data from app/data/input/mock_data.json and produce messages in a Kafka topic (which is configured in the app/main/config.py file)
Running the consumer
- Run the app/main/consumer/mock_data_consumer_app.py file from the project
- This will read the data from the Kafka topic in a PySpark dataframe and save results to multiple csv (app/data/output/mock_data)
- One of the method will write the result to another Kafka topic
Setting up Offset Explorer
- This is a free tool to get an overview of our Kafka cluster
- Download it from - https://www.kafkatool.com/download.html
- Create a new connection and set the variables as follows -
- Cluster name - can be anything
- Zookeeper Host - localhost
- Zookeeper Port - 22181
- Under Advanced -> Bootstrap servers - :29093
Whenever you create a topic while running our project that topic can be seen here and also the number of messages in that topic.
You can also check the messages in a Kafka topic using the following approach -
- Open PowerShell of CMD
- Run
docker ps
- Run
docker exec -it kafka-kafka-1 bash
where kafka-kafka-1 is the name of the docker - Run
cd /bin
- To consume from a Kafka topic -
kafka-console-consumer --bootstrap-server <your-ip>:29093 --topic data-stream-output --from-beginning