Examining the Relationship Between Tree Quality and Socioeconomic Status in New York City

Description of the Project

This project aims to explore the relationship between the quality of trees and household income in New York City. The application plans to use two datasets - one containing information on the health, status, and trunk diameter of trees in the city and another with data on the median income levels of households across different areas of NYC. By analyzing the data, the project will determine whether there is a correlation between the quality of trees and household income across the city. We believe that the results of this application will be useful to inform policymakers and urban planners in their efforts to create more equitable and sustainable urban environments in NYC.

Prerequisites

Upload opencsv to NYU's Dataproc and HDFS

Raw Data

Download tree data here
Download medium income here, select all Income
Get city-zips here
OR use the csv files saved in ./raw_data

Data Cleaning

Upload all files in ./raw_data to to NYU's Dataproc and HDFS
lc4181
- Run Clean with the following command
  sh install.sh
- The proof of success run are saved as output*.png
cz1906
- Run Clean with the following command
  sh install.sh
- The proof of success run are saved as Proof_*.png

Data Profiling

lc4181
- Upload all ./profiling_code/lc4181/*.csv files to NYU's Dataproc and HDFS
- Run Income with the following command
  sh install.sh
- The proof of success run are saved as output*.png
cz1906
- Upload ./profiling_code/cz1906/tree_data_cleaned.csv files to NYU's Dataproc and HDFS
- Run DBH with the following command
  sh dbh_install.sh
- Run Health with the following command
  sh health_install.sh
- The proof of success run are saved as Part1_Proof*.png

Data Ingestion

Upload all *.csv in ./data_ingest to NYU's Dataproc and HDFS
Run ./data_ingest/data_ingest.py on NYU's Dataproc
spark-submit data_ingest.py
View results on HDFS
hd -cat output/part-00000-id.csv
The results are saved here, and proof of success run are saved as output*.png

Analysis

Upload all *.csv in ./ana_code to NYU's Dataproc and HDFS
Run ./ana_code/analysis.py on NYU's Dataproc
spark-submit analysis.py
View YARN logs for results
yarn logs -applicationId <application ID>
The results are saved here, and proof of success run here and here

Results

Our result can be found at here.

Paper

For more details of this project, one can look upt the full paper at here.

Collaborators

Contributor to this project: Charlie Cai(lc4181@nyu.edu), Stephen Zhang (stephen.zhang@nyu.edu)