Data Used:NYPD Complaint Data Historic
The dataset used for this project can be downloaded from the NYC Open Data at https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i.
#For outputof column 15-23.
To get output of column 15-23 , please open output_script_liwei_15_23.
ADDR_PCT_CD-crime_addr.py
LOC_OF_OCCUR_DESC-crime_loc.py
PREM_TYP_DESC-crime_pt.py
PARKS_NM-crime_park.py
HADEVELOPT-crime_hd.py
X_COORD_CD-crime_xc.py
Y_COORD_CD-crime_yc.py
Latitude-crime_latitude.py
Longitude-crime_long.py
Lat_Lon-crime_ll.py
Enter
spark-submit *.py NYPD_Complaint_Data_Historic in dumbo
Then use hadoop fs -getmerge *.out *out
For example: to get output of Y_COORD_CD-crime_yc.py
We could use:
spark-submit crime_yc.py NYPD_Complaint_Data_Historic
After pyspark successfully finsh the task, enter the following command:
hadoop fs -getmerge crime_yc.out crime_yc.out
Then, we could use check.py to check the number of valid/missing/invalid numbers of column output.
For example for crime_yc.out:
spark-submit check.py crime_yc.out
hadoop fs -getmerge check.out check.out
To check semantic type of column output,we could use check_semantic.py:
For example:
spark-submit check_semantic.py crime_yc.out
hadoop fs -getmerge check.out check.out
Map Reduce Output for plots preparation In order to make explorary plot of some columns, we also write some scripts to process different column. Open mr_final ###For column 16-18: PREM_TYP_DESC-crime_16.py output:crime_16.out PARKS_NM-crime_17.py output:crime_17.out HADEVELOPT-crime_18.py output:crime_18.out
To run code:
spark-submit crime_##.py
hadoop fs -getmerge crime_##.out crime_##.out
To get data column labels(basetype, semantic type and validility) for column 1-15, you should run labelscript.py as follow:
spark-submit labelscript.py NYPD_Complaint_Data_Historic.csv
hadoop fs -getmerge [column-name]_output.out [column-name]_output.out
All needed data/output for plotting data summary could be find by running dataSummary1-9.py
In part II, the team did analysis bewween crime rate and subway station location. Essentially, the team is trying to find whether the closer to the subway station, the higher crime rate. The team calculate the Euclidean Distance between subway's location and each crime listing. And draw two circles with the same area(one small circle, one donut-shape circle) to calculate the density. The result is statistically significant.
To join the table of different subway station information and the table of NYC criminal data set. Cartesian join function by PySpark is primarily used. Scircle.py is the file used to perform the joining task and find crimes which occur in inner circle of all subway stations, and 2circle.py is the file used to find the outer circle of all subway stations.
To run the script and get the statistics for small circle:
spark-submit scircle.py NYPD_Complaint_Data_Historic.csv nysub.csv *.out
hadoop fs -getmerge *.out *out
To run the script and get the statistics for docut-shape circle:
spark-submit 2circle.py NYPD_Complaint_Data_Historic.csv nysub.csv *.out
hadoop fs -getmerge *.out *out
And subtract those two output we can compare two areas' density. The result is statistically significant.
To calculate the relation between temperature, unemployment and MOSAS recipeints versus. felony.
PySpark is first used to perform the map reduce job. Defaulted number of mapper and reducers are used because PySpark processes RDD data set. Crime_felony.py is the file, which is used to reduce counts of daily felony happened in Manhattan area.
Then, Jupyter notebook file part2.ipynb is used to aggregate weekly, monthly, quarterly and yearly felony counts as well. Then using the same jupyter notebook, we could draw plots of different relationships and calculate coefficients of different relationships. Similar procedure is used to get misdemeanor data with cime_misdemanor.py.
For example: to run the script
spark-submit crime_felony.py
hadoop fs -getmerge crime_felony.out crime_felony.out
And other cases are running the similar way.
All the used script could be found in the repo.
Data Used:
NYPD Complaint Data Historic
NYC Subway Dataset http://web.mta.info/nyct/service/
NNDC Climate Date https://www7.ncdc.noaa.gov/CDO/dataproduct
Unemployment Statistics data set https://data.ny.gov/Economic- Development/Local-Area-Unemployment-Statistics-LAUS-Seasonally/dh9m-5v4d/data
OASAS-Medicaid data Set https://data.ny.gov/Human-Services/OASAS-Medicaid- Trend-Recipient-Summary-Profile-Beg/g4vm-hyyi/data
NYC police preccint data set http://www1.nyc.gov/site/planning/data-maps/open-data/districts-download- metadata.page
NYC borough data set https://geo.nyu.edu/?f%5Bdc_format_s%5D%5B%5D=Shapefile&f%5Bdc_subject_sm %5D%5B%5D=Boroughs&f%5Bdct_spatial_sm%5D%5B%5D=New+York+City%2C+N ew+York%2C+United+States
NYC population open data set https://data.cityofnewyork.us/City-Government/NYC-Population-by-Borough/h2bk- zmw6/data