Project2

Project Description

This is a group project on COVID 19 data analysis using Spark DataFrame and RDD. Each member has worked on 2 queries. I picked 10 best and 10 worst States of the US based on the ratio of death due to Covid and population. Also, I have showed the rate of Covid deaths against total number ofn deaths per state. The output data has been exported as CSV files to use for visualization.

Technologies Used

  • Apache Spark & Spark SQL
  • HDFS and YARN
  • SBT
  • Scala 2.12.10

Input Data Files Used

  • Time_series_covid_19_deaths_US.csv
  • usDeath2020.csv

Output Files

  • deathBYcovid.csv
  • bestStates.csv
  • worstStates.csv