Data-Streaming-pipeline-with-Airflow-Kafka-Spark-Cassandra: A Python repository from ornab

Project Overview

Welcome to our data pipeline project! Here, we handle data from an API endpoint using essential tools like Apache Airflow, Kafka, Apache Spark, and Cassandra, all managed within Docker containers.

Data Journey:

Data Collection: We start by fetching data from randomuser.me API endpoint to generate random user data for our pipeline..
Workflow Management: Apache Airflow organizes data flow and storing fetched data in a PostgreSQL database.
Streaming via Kafka & Zookeeper: Data seamlessly streams from PostgreSQL to the processing engine using Kafka, enabling efficient real-time processing.
Control Center and Schema Registry: Helps in monitoring and schema management of our Kafka streams.
Processing with Spark: Apache Spark processes and derives insights from the data.
Storage in Cassandra: Finally, Cassandra provides a reliable home for our valuable processed data.

Tools

Apache Airflow
Python
Apache Kafka
Apache Zookeeper
Apache Spark
Cassandra
PostgreSQL
Docker

Proejct Setup

Clone the repository:

git clone https://github.com/ornab/Data-Streaming-pipeline-with-Airflow-Kafka-Spark-Cassandra.git

Navigate to the project directory:

cd Data-Streaming-pipeline-with-Airflow-Kafka-Spark-Cassandra

Run Docker Compose to spin up the services:
```
docker-compose up -d
```