This project focuses on building a real-time e-commerce data streaming pipeline using Apache Kafka, Apache Druid, Apache Airflow and MinIO, all deployed via Docker. The system captures real-time data streams from an e-commerce platform, stores and processes metadata in Druid for fast querying and analysis, while using MinIO as deep storage for long-term data retention. Apache Superset is used to visualize the data, providing real-time insights and dashboards for better decision-making. This setup ensures scalability, performance, and efficiency in handling and analyzing large volumes of streaming data.
This project is built with docker compose, so make sure you have docker and docker-compose installed. Follow the steps instructed in Docker to install it. Then, pull this repo and start the journey.
NOTE: Data collected from Tiki
cd StreamingEcommerceDE
- For the first time, start the MiniO service first to initialize the deepstorage containing the streaming data. Change two param
MINIO_ROOT_USER
andMINIO_ROOT_PASSWORD
by yourself in filedocker-compose.yml
at the blockminio
service. Then, run command:
docker-compose up -d minio
-
Follow this step-by-step to run Druid with deepstorage Minio to clearly. Now, Minio server is running on port
localhost:9001
, login to the server and create adruidbucket
to store segments and indexing_logs, create a service account (under the Identity menu), edit the user policy to only allow access todruidbucket
and in the Druid configuration (./druid/environment
) below use the service account’saccess key
andsecret key
. -
Then, run all the services with command:
docker-compose up -d
- The
USER
andPASSWORD
of some services are configured indocker-compose.yml
, Apache Airflow's password is provided inairflow/standalone_admin_password.txt
.
Service | URL |
---|---|
MiniO | http://localhost:9001 |
Apache Druid | http://localhost:8888 |
Apache Superset | http://localhost:8088 |
Apache Airflow | http://localhost:8080 |
- The file
KafkaProducerEcomm.py
send a message demo data to KafkaEcommerce
topic every second with fake transaction data, the structure of data message as below:
{
'id': 274992707,
'name': 'Hộp Cơm 3 Tầng Lunch Box Kèm Muỗng Đũa Và Túi Đựng Tiện Lợi',
'brand_name': 'PEAFLO',
'price': 192000,
'Origin': 'Hàn Quốc / Trung Quốc',
'category': 'Dụng cụ chứa đựng thực phẩm',
'original_price': 235000,
'discount': 43000,
'discount_rate': 18,
'purchase_id': 'P17281500820277577',
'quantity': 5,
'city': 'Quảng Ninh',
'code_city': 'VN-13',
'create_at': '2024-10-06 00:41:22'
}
- From Druid load data from Kafka
kafka:9092
, choiceEcommrce
topic and config data result table.
- For more infomation, reach github and about configure ingest data process, reach Ingestion overview.
- From Superset server add Druid database with the sqlalchemy uri:
druid://broker:8082/druid/v2/sql/
- More detail at Connecting to Databases
- Create dashboard with amazing chart from
Ecommerce
table
🔥🔥🔥 🤝🤝🤝 🔥🔥🔥