Real-Time Data Analysis Project
Overview
This project aims to analyze real-time data using various technologies such as Druid, Kafka, Spark, Python, Docker and Superset. The project will ingest, process, aggregate and visualize streaming data from different sources and provide insights for business intelligence and decision making.
Technologies
The project will use the following technologies:
- Druid: A high-performance, real-time analytics database that supports sub-second queries on streaming and batch data at scale1.
- Kafka: A distributed streaming platform that enables reliable and scalable data ingestion and processing.
- Spark: A unified analytics engine that supports large-scale data processing and machine learning.
- Python: A popular programming language that offers a rich set of libraries and frameworks for data analysis and manipulation4.
- Docker: A platform that enables building, running and sharing applications using containers.
- Superset: An open-source business intelligence platform that allows users to create and explore interactive dashboards and visualizations on their data.
Notes
- I couldn't add airflow to project because my memory isn't sufficent for that. I have planned different project for usage of airflow.
Roadmap
- Create Kafka docker-compose
- Create Druid docker-compose for OLAP
- Create Spark docker-compose
- Create topic for Druid
- Produce Kafka with Dummy Variable
- Consume dummy data from Kafka
- Show dummy data on Superset
- Create Scraping script from live-api
- Set Kafka config to Druid
- Mount Druid and Kafka
- Create Airflow for schedule and monitor DAG
- Design architecture of live-stream data
- So on, The list will be adjusted for requirements on going
Architecture
Install
- If you are using macOS platform, you have to set env platform specific setting
export DOCKER_DEFAULT_PLATFORM=linux/amd64
Useful Scripts
Kafka
Some basic Kafka commands are:
Create Topic - after connect kafka bash in docker
./bin/kafka-topics.sh --bootstrap-server <topic_name>:9092 --create --topic <topic_name>
List Topics
./bin/kafka-topics.sh --bootstrap-server=localhost:9092 --list
Docker
Some basic Docker commands are:
Run docker-compose file
docker-compose <docker-compose-filename>
Connect docker bash
docker exec -it <container_name> bash
Superset
Important: You must add private network for all nodes in composes in order to connect superset from druid.
Firstly, you can pull files from the original repository and then run below code
Run docker-compose file
cd superset && docker-compose up -d
Hint: If you are using the other docker-compose simultaneously, you have to block port forward in superset docker-compose.
To connect druid, you should add pydruid to requirements.txt file
#pydruid connector
pydruid
Producer
Some basic Python commands are:
python <producer_script_name.py>