The Real-Time Bike Station Availability Monitoring is a system designed to track and analyze the status of cycling stations using the JCDecaux API. It utilizes a data pipeline comprising Kafka, Spark, Elasticsearch, and Kibana to process and visualize real-time data.
- Kafka: Ingests real-time data from cycling stations.
- Spark: Processes and analyzes the data.
- Elasticsearch: Stores and indexes the processed data.
- Kibana: Visualizes insights on an interactive dashboard.
Each service runs in its own Docker container.
- Create an account at JCDecaux Developer to get the access token for the data.
- Create and activate a Conda environment.
- Run the following command to install the required packages:
pip install -r requirements.txt
- Build the Docker containers:
docker-compose build
- Start the Docker containers in detached mode:
docker-compose up -d
- Copy the
pyspark_consumer.py
script into the Spark container:docker cp -L ./pyspark_consumer.py real-time-cycling-station-monitor-project-spark-master-1:/opt/bitnami/spark/pyspark_consumer.py
- Execute the Spark job to process data:
docker-compose exec spark-master spark-submit --class consumer --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.4,org.elasticsearch:elasticsearch-spark-30_2.12:8.8.2,commons-httpclient:commons-httpclient:3.1 ./pyspark_consumer.py
- Run the Kafka producer:
python ./kafka_producer.py
You can visualize the results in the interactive Kibana dashboard. Below is an example of the dashboard: