This project is a data engineering solution for AdvertiseX, designed to handle data ingestion, processing, storage, and monitoring of ad impression, click, and bid request data.
ingestion/
: Contains Kafka producer and consumer scripts.processing/
: Contains Spark processing and Hudi writing scripts.config/
: Contains configuration files.monitoring/
: Contains Prometheus and Grafana configuration files.requirements.txt
: Python dependencies.README.md
: Project documentation.
- Start Kafka and create necessary topics.
- Run the Kafka producer.
- Run the Kafka consumer.
- Start Spark processing and Hudi writing jobs.
- Set up Prometheus and Grafana for monitoring.
- Start Kafka and create topics:
bin/kafka-topics.sh --create --topic ad_impressions --bootstrap-server localhost:9092 bin/kafka-topics.sh --create --topic ad_clicks --bootstrap-server localhost:9092 bin/kafka-topics.sh --create --topic bid_requests --bootstrap-server localhost:9092
- Run Kafka producer:
python ingestion/kafka_producer.py
- Run Kafka consumer:
python ingestion/kafka_consumer.py
- Start Spark processing jobs:
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,org.apache.hudi:hudi-spark-bundle_2.12:0.9.0 processing/spark_processor.py spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,org
- Configure and start Prometheus and Grafana.
Prometheus will scrape metrics from Kafka and Spark, and Grafana will visualize them. Import the provided Grafana dashboard JSON to set up the visualizations.