
Examining the Relationship Between Tree Quality and Socioeconomic Status in New York City

Primary LanguageJava

Examining the Relationship Between Tree Quality and Socioeconomic Status in New York City

This is the class project in New York University Processing Big Data for Analytics Applications class during 2023 Spring. All rights reserved.

Description of the Project

This project aims to explore the relationship between the quality of trees and household income in New York City. The application plans to use two datasets - one containing information on the health, status, and trunk diameter of trees in the city and another with data on the median income levels of households across different areas of NYC. By analyzing the data, the project will determine whether there is a correlation between the quality of trees and household income across the city. We believe that the results of this application will be useful to inform policymakers and urban planners in their efforts to create more equitable and sustainable urban environments in NYC.


  • Upload opencsv to NYU's Dataproc and HDFS

Raw Data

  1. Download tree data here
  2. Download medium income here, select all Income
  3. Get city-zips here
  4. OR use the csv files saved in ./raw_data

Data Cleaning

  • Upload all files in ./raw_data to to NYU's Dataproc and HDFS
  • lc4181
    • Run Clean with the following command
      sh install.sh
    • The proof of success run are saved as output*.png
  • cz1906
    • Run Clean with the following command
      sh install.sh
    • The proof of success run are saved as Proof_*.png

Data Profiling

  • lc4181
    • Upload all ./profiling_code/lc4181/*.csv files to NYU's Dataproc and HDFS
    • Run Income with the following command
      sh install.sh
    • The proof of success run are saved as output*.png
  • cz1906
    • Upload ./profiling_code/cz1906/tree_data_cleaned.csv files to NYU's Dataproc and HDFS
    • Run DBH with the following command
      sh dbh_install.sh
    • Run Health with the following command
      sh health_install.sh
    • The proof of success run are saved as Part1_Proof*.png

Data Ingestion

  • Upload all *.csv in ./data_ingest to NYU's Dataproc and HDFS
  • Run ./data_ingest/data_ingest.py on NYU's Dataproc
    spark-submit data_ingest.py
  • View results on HDFS
    hd -cat output/part-00000-id.csv
  • The results are saved here, and proof of success run are saved as output*.png


  • Upload all *.csv in ./ana_code to NYU's Dataproc and HDFS
  • Run ./ana_code/analysis.py on NYU's Dataproc
    spark-submit analysis.py
  • View YARN logs for results
    yarn logs -applicationId <application ID>
  • The results are saved here, and proof of success run here and here



  • For more details of this project, one can look upt the full paper at here.


Contributor to this project: Charlie Cai(lc4181@nyu.edu), Stephen Zhang (stephen.zhang@nyu.edu)