Proposal:
- Create a Spark Application that processes data from Kagle at this link:
- https://www.kaggle.com/cdc/mortality
- We will be analizing the following questions:
- Average age of death from lightning
- Manner of death % for each year
- Marriage life expectancy throughout the years
- How many people died at work while doing non-work activities
- Deaths on the job compared to education level
- Deaths on the job compared to age
- accumulative % of autopsied vs. non-autopsied
- How many people died in the month of May while engaged in sports activities
- What are the main ways people die while doing vital activities?
- What are the main causes of death while resting, sitting, or eating?
- What's the impact of education on the life expectancy of an individual?
- Deadliest day of the week and deadliest month of the year?
- Deadliest day of the year?
- what are the main causes of death for infants? (top 5-10ish)
- Apache Spark
- Spark SQL
- YARN
- HDFS and/or S3
- Scala
- Git + GitHub
- Presentations will take place on Friday, 19-Nov-2021.
As previously stated, data is collected in the form of CSV files from https://www.kaggle.com/cdc/mortality.
Download the individual CSV files by year, and upload each CSV file to /home/maria_dev. Then, add the file name to a separate CSV file called "csvFileList.csv", and add a return ("\n") at the end. This serves as the delimiter. Upload this specific file to /home/maria_dev.