Crime Hot Spots Visualization

Brief Introduction

Using Clustering algorithm to dig out the similarity between crime events.

There are 2 parts in this demo.

1. Crime Hot Spots Clustering K-Medoids (2017)

In this demo, we will apply KMedoids algorithm to cluster the crime data and find out some similarity between different crime events.

Get Started (Without rebuild dataset)

Open Terminal/Cmd/Bash
Clone this Repo
Switch to the directory of crime-clustering-kmedoids
Type python -m http.server 1235 (MAKE SURE your are using Python 3.x)
Open your Browser and visit http://localhost:1235/

$ git clone https://github.com/tsengkasing/Crime-Hot-Spots-Visualization.git
$ cd Crime-Hot-Spots-Visualization
$ cd crime-clustering-kmedoids
$ python -m http.server 1235

Gif of step by step ↓

Get Started (Rebuild Dataset)

Using the preprocess.py:

preprocess.py is used for preprocess the dataset, mainly about split the dataset into several parts.

command:

python ./preprocess.py (splitByYear|splitByMonth|Sampling) dataFilePath SampleNum

Param Name	Description
splitByYear	the datafile will be split into different different years
splitByMonth	the datafile will be split into different Months. It is suggested that to excute on the file which has already been splited into different years, because it will only deal with the month string。
Sampling	generate samples of the data file
dataFilePath	the path of the datafile
SampleNum	should be integer, and is necessarily needed when the command is Sampling.

Examples:

python ./preprocess.py Sampling ./Police_Department_Incident_Reports__Historical_2003_to_May_2018-2018.csv 500

Using the k-medoids-clustering.py:

k-medoids-clustering.py is the k-medoids-clustering process, it will do clustering under the input data file, and output clustering results. command:

python ./k-medoids-clustering.py kNum dataFilePath

Param Name	Description
kNum	the k in an algorithm, which is the number of target clusters. If this argument is not provided, the algorithm will be running of 5 as the value of K.
dataFilePath	the path of dataFile. If it is not provided, the algorithm will do clustering on default file Police_Department_Incident_Reports__Historical_2003_to_May_2018-2017-sample-500.csv

Examples:

python ./k-medoids-clustering.py 5 Police_Department_Incident_Reports__Historical_2003_to_May_2018-2017-sample-500.csv

Outputs:

result-k-kNum-time-hour-min-sec.txt
result-k-kNum-time-hour-min-sec.png
result-k-kNum-time-hour-min-sec-data.json
result-k-kNum-time-hour-min-sec-plot.json
result-k-kNum-time-hour-min-sec.txt: It records some basic result of the clustering, such as: iteration rounds, SSE of the final clusters, SilhouetteCoeficient, medoids' index, data cluster results
result-k-kNum-time-hour-min-sec.png: It is a 2-dimensional diagram of the clustering result. It takes attribute Crime Catagory as axis y, attribute Time as axis x.
result-k-kNum-time-hour-min-sec-data.json, result-k-kNum-time-hour-min-sec-plot.json: These 2 files are the inputs for visualization.

2. Crime Hot Spots With TimeLine (Robbery 2017)

In this demo, we will apply KMeans algorithm to cluster the crime data and find out some similarity between robbery crime events within a time period.

Get Started

Open Terminal/Cmd/Bash
Clone this Repo
Switch to the directory of crime-robbery-2017-timeline
Type python -m http.server 1234 (MAKE SURE your are using Python 3.x)
Open your Browser and visit http://localhost:1234/

$ git clone https://github.com/tsengkasing/Crime-Hot-Spots-Visualization.git
$ cd Crime-Hot-Spots-Visualization
$ cd crime-robbery-2017-timeline
$ python -m http.server 1234

Gif of step by step ↓

Tiny Conclusion

After several attempts, we cluster the data into 3 cluster.

Cluster 1 only appear in night(17:00 ~ 23:00)
Cluster 2 only appear before dawn (0:00 ~ 7:00)
Cluster 3 only appear in day(8:00 ~ 16:00)

All of them will happen in City Center.

However, Cluster 1 and Cluster 2 will happen more in suburb .

Implementation Method Introduction

[P.S.] The following processing code is placed in cluster-crime-hot-spots.py.

In the beginning, we download and import the data from DataSF(San Francisco's data).

Here is the link https://data.sfgov.org/Public-Safety/Police-Department-Incident-Reports-Historical-2003/tmnf-yvry

data = pd.read_csv('/path/to/Police_Department_Incident_Reports__Historical_2003_to_May_2018.csv')

Next we input the following code data['Category'].unique().

We can see that there are 39 kinds of crime category.

['NON-CRIMINAL', 'ROBBERY', 'ASSAULT', 'SECONDARY CODES',
       'VANDALISM', 'BURGLARY', 'LARCENY/THEFT', 'DRUG/NARCOTIC',
       'WARRANTS', 'VEHICLE THEFT', 'OTHER OFFENSES', 'WEAPON LAWS',
       'ARSON', 'MISSING PERSON', 'DRIVING UNDER THE INFLUENCE',
       'SUSPICIOUS OCC', 'RECOVERED VEHICLE', 'DRUNKENNESS', 'TRESPASS',
       'FRAUD', 'DISORDERLY CONDUCT', 'SEX OFFENSES, FORCIBLE',
       'FORGERY/COUNTERFEITING', 'KIDNAPPING', 'EMBEZZLEMENT',
       'STOLEN PROPERTY', 'LIQUOR LAWS', 'FAMILY OFFENSES', 'LOITERING',
       'BAD CHECKS', 'TREA', 'GAMBLING', 'RUNAWAY', 'BRIBERY',
       'PROSTITUTION', 'PORNOGRAPHY/OBSCENE MAT',
       'SEX OFFENSES, NON FORCIBLE', 'SUICIDE', 'EXTORTION']

Due to the large dataset, we will only focus on the events data of Robbery in 2017 in this demo.

Before we apply the KMeans algorithm, we need to transform the nominal attribute to numeric attribute.

We mainly focus on 4 attributes PdDistrict, DayOfWeek, Hour, Month .

Build a dictionary of PdDistrict

array_pddistrict = data['PdDistrict'].unique()
map_pddistrict = {}
for i in range(len(array_pddistrict)):
    map_pddistrict[array_pddistrict[i]] = i

Build a dictionary of DayOfWeek

array_day = data['DayOfWeek'].unique()
map_day = {}
for i in range(len(array_day)):
    map_day[array_day[i]] = i

Add New Column

# PdDistrict
def getPdDistrict(arr):
    return int(map_pddistrict[arr['PdDistrict']])
data['pddistrict_numeric'] = data.apply(getPdDistrict, axis = 1)

# DayOfWeek
def getDay(arr):
    return int(map_day[arr['DayOfWeek']])
data['day_numeric'] = data.apply(getDay, axis = 1)

# Hour
def getHour(arr):
    return int(arr['Time'][:2])
data['Hour'] = data.apply(getHour, axis = 1)

# Month
def getMonth(arr):
    return int(arr['Date'][:2])
data['month'] = data.apply(getMonth, axis = 1)

Select Data of 2017

data_2017 = data[data['Date'].str.contains('2017')]

Build an "X" Matrix with the Data of Robbery

X = []
for i in range(len(data_2017)):
    row = data_2017.iloc[i]
    if row['Category'] == 'ROBBERY':
        X.append([row['pddistrict_numeric'], row['Hour'], row['day_numeric'], row['month']])

X = np.array(X)

Apply KMeans Algorithm

kmeans = KMeans(n_clusters=3, random_state=0).fit(X)

Clusters labels
```
kmeans.labels_
```

Build a new Matrix contains latitude("X") and longitude("Y")

output = []
for i in range(len(X)):
    row = data_2017.iloc[i]
    output.append(np.hstack((X[i], [kmeans.labels_[i]], [row['X'], row['Y']])))

Save as json to local

generated_2017 = pd.DataFrame(output)
generated_2017.to_json('/path/to/crime_robbery_2017.json', orient='records')

Using Google Maps API to visualize the data according Label and Hour .

Please refer to the Google Maps Platform Documents

https://developers.google.com/maps/documentation/javascript/examples/circle-simple