/Project-Two-Amplifire-Github

This is where we are analyzing stuff

Primary LanguageScala

Project 2

Proposal:

Presentations

  • We will be analizing the following questions:
  1. Average age of death from lightning
  2. Manner of death % for each year
  3. Marriage life expectancy throughout the years
  4. How many people died at work while doing non-work activities
  5. Deaths on the job compared to education level
  6. Deaths on the job compared to age
  7. accumulative % of autopsied vs. non-autopsied
  8. How many people died in the month of May while engaged in sports activities
  9. What are the main ways people die while doing vital activities?
  10. What are the main causes of death while resting, sitting, or eating?
  11. What's the impact of education on the life expectancy of an individual?
  12. Deadliest day of the week and deadliest month of the year?
  13. Deadliest day of the year?
  14. what are the main causes of death for infants? (top 5-10ish)

Technologies

  • Apache Spark
  • Spark SQL
  • YARN
  • HDFS and/or S3
  • Scala
  • Git + GitHub

Due Date

  • Presentations will take place on Friday, 19-Nov-2021.

Instructions for Using the Tool

As previously stated, data is collected in the form of CSV files from https://www.kaggle.com/cdc/mortality.

Download the individual CSV files by year, and upload each CSV file to /home/maria_dev. Then, add the file name to a separate CSV file called "csvFileList.csv", and add a return ("\n") at the end. This serves as the delimiter. Upload this specific file to /home/maria_dev.