Welcome to our data pipeline project! Here, we handle data from an API endpoint using essential tools like Apache Airflow, Kafka, Apache Spark, and Cassandra, all managed within Docker containers.
- Data Collection: We start by fetching data from
randomuser.me
API endpoint to generate random user data for our pipeline.. - Workflow Management: Apache Airflow organizes data flow and storing fetched data in a PostgreSQL database.
- Streaming via Kafka & Zookeeper: Data seamlessly streams from PostgreSQL to the processing engine using Kafka, enabling efficient real-time processing.
- Control Center and Schema Registry: Helps in monitoring and schema management of our Kafka streams.
- Processing with Spark: Apache Spark processes and derives insights from the data.
- Storage in Cassandra: Finally, Cassandra provides a reliable home for our valuable processed data.
- Apache Airflow
- Python
- Apache Kafka
- Apache Zookeeper
- Apache Spark
- Cassandra
- PostgreSQL
- Docker
-
Clone the repository:
git clone https://github.com/ornab/Data-Streaming-pipeline-with-Airflow-Kafka-Spark-Cassandra.git
-
Navigate to the project directory:
cd Data-Streaming-pipeline-with-Airflow-Kafka-Spark-Cassandra
-
Run Docker Compose to spin up the services:
docker-compose up -d