This project uses Spark engine to analyze the data taken from the github repository of vaccinations in Italy against Covid 19 (covid19-opendata-vaccini) in order to answer the following queries:
Query1 - mean number of vaccinations for a generic center for each region and each month
Query2 - for each month and age the top five regions which have been predicted to have the highest number of women vaccinated the first day of that month
Query3 - for each region predict the total number of vaccinations for the first day of June and classify them using K-means or Bisecting K-means as clustering algorithms
The code for each of these queries can be found in src/main/java/queries, but you can also find alternative implementations for the second and third queries using SparkSQL in folder src/main/java/sql_queries.
This project uses docker and docker-compose to instantiate the HDFS, Spark, Nifi and Redis containers.
Worker nodes for both spark and hdfs can be scaled as needed using docker-compose:
docker-compose up --scale spark-worker=3 --scale datanode=4
e.g. cluster with 3 spark workers and 4 hdfs datanodes.
On the first deployment of the cluster you can import the templates to use in nifi saved in the folder /nifi/templates:
- input.xml - takes data from the github repository and injects them into hdfs
- redis.xml - takes data from hdfs and puts them in redis
Create the jar needed for the submission of the query:
mvn package
To submit a query to the spark cluster you can use the scripts in the folder /scripts in the root of the project.
sh submit_query.sh 1
sh submit-query.sh 3 0 4
The first parameter specifies which query to submit, while other parameters are necessary only for query 3 and sql query 3 to specify algorithm (0 for k-means and 1 for bisecting k-means) and number of clusters.
- http://localhost:9870 hdfs namenode
- http://localhost:8080 spark master
- http://localhost:4040 spark application
- http://localhost:9090/nifi nifi
You can visualize the data collected trough the Grafana dashboard using the following link ( shows the ranking for each region and the trend for each category using only the data from Lazio ):