- Clone the repository
- Download the corresponding datasets from Kaggle
- Copy the downloaded data to the /dataset folder of this repository
Important: We have implemented a parallelization strategy, which is deactivated by default. If you are using a unix / macOS machine, search for the Globals section (top of code) and set RUN_PARALLEL = True
to fasten the execution.
- All sections are surrounded by triple quotes. They allow you to comment / uncomment whole sections:
# This is a commented section
"""
print("Started section: Data loading")
dataset_04 = pd.read_csv('./dataset/Chicago_Crimes_2001_to_2004.csv',
sep=',', header=0, error_bad_lines=False, low_memory=False,
na_values=[''])
#"""
# This is an uncommented section
#"""
print("Started section: Data loading")
dataset_04 = pd.read_csv('./dataset/Chicago_Crimes_2001_to_2004.csv',
sep=',', header=0, error_bad_lines=False, low_memory=False,
na_values=[''])
#"""
- The section Data understanding is commented out by default, as the correlation analyses needs quite long.
- Within the Modeling section we have commented all models except for Decision Tree. You can add or remove them as you like.
- The code is not very talkative and does not write many results to stdout. We have used a slightly customized MatLab layout of Spyder.
- We have prepared some great visualizations of the dataset. You can play with them here.
- Within the /results folder you will find two .csv files containing already prepared result sets of the standard logistic regression and Bernoulli Naive Bayes method.
- The dataset is enormous, so bring plenty of time and patience.
- Multinomial Logistic Regression will need several hours (we did not manage to execute a full-blown test), you should think about a nightly run. The correlation analyses are also quite slow. We have added a global constant
CORRELATION_SAMPLE_SIZE = 0.25
to configure a sample size for the calculations. - We have used a MacBook Pro equipped with an i7 processor and 16GB RAM. You should not use your grandma's PC with this data.
Feel free to open an issue, we will try to come back to you as fast as possible!