Microservice-based Data Pipeline with Kafka, Spark, and Metabase
This repository contains a microservice-based data pipeline that uses Apache Kafka, Apache Spark, and Metabase for data ingestion, processing, and visualization. The pipeline is designed to efficiently handle streaming data, process it using Spark, and create interactive dashboards for data visualization using Metabase.
- minikube /kubectl
- docker
-
Clone the repository:
git clone https://github.com/phaneesh707/microservice-based-datapipeline cd microservice-based-datapipeline
-
Just run all the script files in each of the folder
./script_file_name.sh
-
Create kafka topic & update it in producer , consumer file
kubectl exec POD_NAME -- kafka-topics.sh --create --topic TOPIC-NAME \ --bootstrap-server kafka-svc:9092 --partitions 1 --replication-factor 1
- to list all the topics created
kubectl exec POD_NAME -- kafka-topics.sh --list --bootstrap-server kafka-svc:9092
-
Enter into to postgres pod and create a DB and table and update the table name in consumer.py
psql -u USER user CREATE DATABASE DB-NAME # create table
-
Enter into producers pod & run producer file
python producer.py
-
Enter into consumer pod & run the follwowing command
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0 consumer.py
- now you can see the data being processed and written to database
-
Run
minikube tunnel
- this will make the dashboard available to localhost
-
Get the IP of the metabase pod
kubectl get svc
- copy the Ip of the 'metabase-service' and paste it in the browser
-
Ta-da! now you can access you dash-borad and get the analytics of your data
-
Enter into pod
kubectl exec -it POD-NAME -- /bin/bash
-
Incase of errors to check logs of pod
kubectl logs POD-NAME
-
To describe pod
kubectl describe pod POD-NAME
-
To get all the pods/services/deployments
kubectl get pods/services/deployments