BigDataProject

Data Cleaning, Data Analysis and Data Exploration

Part 1 - Data Cleaning

Steps to Reproduce Results

Download dataset from Crime Dataset Link provided above. Rename file as 'NYPD_Complaint_Data_Historic.csv'.
Login to hadoop cluster. Create directory project.
Copy csv file from local to the hadoop cluster in project directory -

Command: scp -r NYPD_Complaint_Data_Historic.csv NetID@dumbo.es.its.nyu.edu:/home/NetID/project/ .

Also copy the scripts you want to run to the hadoop cluster.
Put the csv file into the Hadoop File system -

Command: hfs -put NYPD_Complaint_Data_Historic.csv
Running script for individual columns (Eg. for column 10)-

Command: spark-submit --py-files=helper.py col10.py NYPD_Complaint_Data_Historic.csv
To obtain cleaned csv file with all columns

Run script ./execute.sh

Then Run Command: spark-submit --py-files=helper.py merge.py NYPD_Complaint_Data_Historic.csv
Cleaned csv can be obtained using below command

Command:hfs -getmerge data.csv cleaned.csv

The complete data analysis was performed on the cleaned csv file obtained after data cleaning.

Steps to generate cleaned csv file are mentioned in Part 1 above.

Pre Requisites - Python, Pandas, Jupyter Notebook, Matplotlib, Numpy

Steps to Reproduce Results

Upload the cleaned csv file obtained from Part 1 to the Hadoop cluster-

Command: hfs -put cleaned.csv
Run scripts to generate data which will be used to plot results- Scripts can be downloaded from the Data Analysis/scripts folder.

Command: spark-submit --py-files=helper.py crimes_by_year_month.py cleaned.csv

Copy data generated after running the scripts from the hadoop cluster to local machine in the Data Analysis/data folder-

Command: scp -r NETID@dumbo.es.its.nyu.edu:/home/NETID/project/DataAnalysis/* .

Start jupyter Notebook from the DataAnalysis folder- Command: jupyter notebook
Open any of the ipynb files (eg - crimes_by_year.ipynb) to run and generate plots.

Weather - Crime rate is high during summer as compared to winter.

Monthly Weather Data in New York collected from link - http://www.holiday-weather.com/new_york_city/averages

Utilized crimes data over the month generated during analysis to prove hypotheses.
Poverty - Crime rate increases with increase in poverty

Yearly Poverty Data in New York Boroughs collected from link - http://www1.nyc.gov/site/opportunity/poverty-in-nyc/data-tool.page

Utilized crimes data over the years in New York Boroughs generated during analysis to prove hypotheses.