This is the class project in New York University Processing Big Data for Analytics Applications class during 2023 Spring. All rights reserved.
This project aims to explore the relationship between the quality of trees and household income in New York City. The application plans to use two datasets - one containing information on the health, status, and trunk diameter of trees in the city and another with data on the median income levels of households across different areas of NYC. By analyzing the data, the project will determine whether there is a correlation between the quality of trees and household income across the city. We believe that the results of this application will be useful to inform policymakers and urban planners in their efforts to create more equitable and sustainable urban environments in NYC.
- Upload opencsv to NYU's Dataproc and HDFS
- Download tree data here
- Download medium income here, select all Income
- Get city-zips here
- OR use the csv files saved in ./raw_data
- Upload all files in ./raw_data to to NYU's Dataproc and HDFS
- lc4181
- Run Clean with the following command
sh install.sh
- The proof of success run are saved as output*.png
- Run Clean with the following command
- cz1906
- Run Clean with the following command
sh install.sh
- The proof of success run are saved as Proof_*.png
- Run Clean with the following command
- lc4181
- Upload all ./profiling_code/lc4181/*.csv files to NYU's Dataproc and HDFS
- Run Income with the following command
sh install.sh
- The proof of success run are saved as output*.png
- cz1906
- Upload ./profiling_code/cz1906/tree_data_cleaned.csv files to NYU's Dataproc and HDFS
- Run DBH with the following command
sh dbh_install.sh
- Run Health with the following command
sh health_install.sh
- The proof of success run are saved as Part1_Proof*.png
- Upload all *.csv in ./data_ingest to NYU's Dataproc and HDFS
- Run ./data_ingest/data_ingest.py on NYU's Dataproc
spark-submit data_ingest.py
- View results on HDFS
hd -cat output/part-00000-id.csv
- The results are saved here, and proof of success run are saved as output*.png
- Upload all *.csv in ./ana_code to NYU's Dataproc and HDFS
- Run ./ana_code/analysis.py on NYU's Dataproc
spark-submit analysis.py
- View YARN logs for results
yarn logs -applicationId <application ID>
- The results are saved here, and proof of success run here and here
- For more details of this project, one can look upt the full paper at here.
Contributor to this project: Charlie Cai(lc4181@nyu.edu), Stephen Zhang (stephen.zhang@nyu.edu)