Crime Hot Spots Visualization
Brief Introduction
Using Clustering algorithm to dig out the similarity between crime events.
There are 2 parts in this demo.
1. Crime Hot Spots Clustering K-Medoids (2017)
In this demo, we will apply KMedoids algorithm to cluster the crime data and find out some similarity between different crime events.
Get Started (Without rebuild dataset)
- Open Terminal/Cmd/Bash
- Clone this Repo
- Switch to the directory of crime-clustering-kmedoids
- Type
python -m http.server 1235
(MAKE SURE your are using Python 3.x) - Open your Browser and visit
http://localhost:1235/
$ git clone https://github.com/tsengkasing/Crime-Hot-Spots-Visualization.git
$ cd Crime-Hot-Spots-Visualization
$ cd crime-clustering-kmedoids
$ python -m http.server 1235
Gif of step by step ↓
Get Started (Rebuild Dataset)
Using the preprocess.py:
preprocess.py is used for preprocess the dataset, mainly about split the dataset into several parts.
command:
python ./preprocess.py (splitByYear|splitByMonth|Sampling) dataFilePath SampleNum
Param Name | Description |
---|---|
splitByYear | the datafile will be split into different different years |
splitByMonth | the datafile will be split into different Months. It is suggested that to excute on the file which has already been splited into different years, because it will only deal with the month string。 |
Sampling | generate samples of the data file |
dataFilePath | the path of the datafile |
SampleNum | should be integer, and is necessarily needed when the command is Sampling. |
Examples:
python ./preprocess.py Sampling ./Police_Department_Incident_Reports__Historical_2003_to_May_2018-2018.csv 500
Using the k-medoids-clustering.py:
k-medoids-clustering.py is the k-medoids-clustering process, it will do clustering under the input data file, and output clustering results. command:
python ./k-medoids-clustering.py kNum dataFilePath
Param Name | Description |
---|---|
kNum | the k in an algorithm, which is the number of target clusters. If this argument is not provided, the algorithm will be running of 5 as the value of K. |
dataFilePath | the path of dataFile. If it is not provided, the algorithm will do clustering on default file Police_Department_Incident_Reports__Historical_2003_to_May_2018-2017-sample-500.csv |
Examples:
python ./k-medoids-clustering.py 5 Police_Department_Incident_Reports__Historical_2003_to_May_2018-2017-sample-500.csv
Outputs:
-
result-k-kNum-time-hour-min-sec.txt
-
result-k-kNum-time-hour-min-sec.png
-
result-k-kNum-time-hour-min-sec-data.json
-
result-k-kNum-time-hour-min-sec-plot.json
-
result-k-kNum-time-hour-min-sec.txt: It records some basic result of the clustering, such as: iteration rounds, SSE of the final clusters, SilhouetteCoeficient, medoids' index, data cluster results
-
result-k-kNum-time-hour-min-sec.png: It is a 2-dimensional diagram of the clustering result. It takes attribute Crime Catagory as axis y, attribute Time as axis x.
-
result-k-kNum-time-hour-min-sec-data.json, result-k-kNum-time-hour-min-sec-plot.json: These 2 files are the inputs for visualization.
2. Crime Hot Spots With TimeLine (Robbery 2017)
In this demo, we will apply KMeans algorithm to cluster the crime data and find out some similarity between robbery crime events within a time period.
Get Started
-
Open Terminal/Cmd/Bash
-
Clone this Repo
-
Switch to the directory of crime-robbery-2017-timeline
-
Type
python -m http.server 1234
(MAKE SURE your are using Python 3.x) -
Open your Browser and visit
http://localhost:1234/
$ git clone https://github.com/tsengkasing/Crime-Hot-Spots-Visualization.git
$ cd Crime-Hot-Spots-Visualization
$ cd crime-robbery-2017-timeline
$ python -m http.server 1234
Gif of step by step ↓
Tiny Conclusion
After several attempts, we cluster the data into 3 cluster.
-
Cluster 1 only appear in night(17:00 ~ 23:00)
-
Cluster 2 only appear before dawn (0:00 ~ 7:00)
-
Cluster 3 only appear in day(8:00 ~ 16:00)
All of them will happen in City Center.
However, Cluster 1 and Cluster 2 will happen more in suburb .
Implementation Method Introduction
[P.S.] The following processing code is placed in cluster-crime-hot-spots.py
.
In the beginning, we download and import the data from DataSF(San Francisco's data).
Here is the link https://data.sfgov.org/Public-Safety/Police-Department-Incident-Reports-Historical-2003/tmnf-yvry
data = pd.read_csv('/path/to/Police_Department_Incident_Reports__Historical_2003_to_May_2018.csv')
Next we input the following code data['Category'].unique()
.
We can see that there are 39 kinds of crime category.
['NON-CRIMINAL', 'ROBBERY', 'ASSAULT', 'SECONDARY CODES',
'VANDALISM', 'BURGLARY', 'LARCENY/THEFT', 'DRUG/NARCOTIC',
'WARRANTS', 'VEHICLE THEFT', 'OTHER OFFENSES', 'WEAPON LAWS',
'ARSON', 'MISSING PERSON', 'DRIVING UNDER THE INFLUENCE',
'SUSPICIOUS OCC', 'RECOVERED VEHICLE', 'DRUNKENNESS', 'TRESPASS',
'FRAUD', 'DISORDERLY CONDUCT', 'SEX OFFENSES, FORCIBLE',
'FORGERY/COUNTERFEITING', 'KIDNAPPING', 'EMBEZZLEMENT',
'STOLEN PROPERTY', 'LIQUOR LAWS', 'FAMILY OFFENSES', 'LOITERING',
'BAD CHECKS', 'TREA', 'GAMBLING', 'RUNAWAY', 'BRIBERY',
'PROSTITUTION', 'PORNOGRAPHY/OBSCENE MAT',
'SEX OFFENSES, NON FORCIBLE', 'SUICIDE', 'EXTORTION']
Due to the large dataset, we will only focus on the events data of Robbery in 2017 in this demo.
Before we apply the KMeans algorithm, we need to transform the nominal attribute to numeric attribute.
We mainly focus on 4 attributes PdDistrict
, DayOfWeek
, Hour
, Month
.
-
Build a dictionary of
PdDistrict
array_pddistrict = data['PdDistrict'].unique() map_pddistrict = {} for i in range(len(array_pddistrict)): map_pddistrict[array_pddistrict[i]] = i
-
Build a dictionary of
DayOfWeek
array_day = data['DayOfWeek'].unique() map_day = {} for i in range(len(array_day)): map_day[array_day[i]] = i
-
Add New Column
# PdDistrict def getPdDistrict(arr): return int(map_pddistrict[arr['PdDistrict']]) data['pddistrict_numeric'] = data.apply(getPdDistrict, axis = 1) # DayOfWeek def getDay(arr): return int(map_day[arr['DayOfWeek']]) data['day_numeric'] = data.apply(getDay, axis = 1) # Hour def getHour(arr): return int(arr['Time'][:2]) data['Hour'] = data.apply(getHour, axis = 1) # Month def getMonth(arr): return int(arr['Date'][:2]) data['month'] = data.apply(getMonth, axis = 1)
-
Select Data of 2017
data_2017 = data[data['Date'].str.contains('2017')]
-
Build an "X" Matrix with the Data of Robbery
X = [] for i in range(len(data_2017)): row = data_2017.iloc[i] if row['Category'] == 'ROBBERY': X.append([row['pddistrict_numeric'], row['Hour'], row['day_numeric'], row['month']]) X = np.array(X)
-
Apply KMeans Algorithm
kmeans = KMeans(n_clusters=3, random_state=0).fit(X)
-
Clusters labels
kmeans.labels_
-
Build a new Matrix contains latitude("X") and longitude("Y")
output = [] for i in range(len(X)): row = data_2017.iloc[i] output.append(np.hstack((X[i], [kmeans.labels_[i]], [row['X'], row['Y']])))
-
Save as json to local
generated_2017 = pd.DataFrame(output) generated_2017.to_json('/path/to/crime_robbery_2017.json', orient='records')
-
Using Google Maps API to visualize the data according
Label
andHour
.Please refer to the Google Maps Platform Documents
https://developers.google.com/maps/documentation/javascript/examples/circle-simple