/soen471-project

SOEN 471 project

Primary LanguageJupyter Notebook

Firefighting cost prediction with tree-based models

Firefighting is one of the most cost-demanding domains of the public safety sector in terms of after-damage and preventive measures. Nonetheless, it is a fundamental public service and keeping track of its cost can benefit both the firefighters working at the heart of the action and the government bodies which manage them.

Therefore, in this project, we answer the following research questions:

  1. Is it possible to predict the notional cost of firefighting operations using tree-based models?
  2. Which tree-based model provides the best prediction accuracy?
  3. What features correlate more or less strongly with the notional cost?

To answer these questions, we use public data from London (UK), which can be found on Kaggle:

We join both CSV files by date, ignoring fire incidents from 2022 and weather records from 1979 to 2008 since there are no matching records in both datasets. The final dataset contains 1,286,617 rows. We use this dataset to train three tree-based models that work well with tabular data:

  • decision tree
  • random forest
  • boosted tree

As for the algorithms, we use the following implementations provided by the Python libraries scikit-learn and XGBoost:

We use five features as the inputs of our models:

  1. month of the fire incident report call
  2. building type
  3. number of fire pumps attending the incident
  4. number of hours the fire pumps worked
  5. mean daily temperature (Cº)

Our output feature is the fire pumps' notional cost in pound sterling (£). The cost value was originally a continuous numerical variable, but we converted it to a categorical variable, dividing and categorizing the numerical value in intervals of £300. This process is explained in more detail in preprocessing/2_check_correlation.ipynb.

Finally, we evaluate to determine the best model in terms of prediction accuracy. In practice, the model we describe in this project could help estimate the operation cost of fire departments as soon as a call for a fire incident is made.

Detailed feature description

Feature Type Description Example
DateOfCall Categorical Date of call to fire dept (we kept the month only) 12
PropertyType Categorical Description of the place where the fire happened House - single occupancy
NumPumpsAttending Integer Number of pumps used in the incident. Number of firefighters = number of pumps multiplied by five 3
PumpHoursRoundUp Integer Time spent at incident by pumps, rounded up to nearest hour 1
mean_temp Float Mean temperature in degrees Celsius (°C) 2.8
Notional Cost (£) Categorical Time spent multiplied by notional annual cost of a pump, in pounds. Originally in pounds, we converted this feature from numerical to categorical. 1

Model evaluation

We divide the rows in the dataset between training (66.6%) and testing (33.3%).We use hyperparameter tuning and cross-validation. More details at chart.ipynb.

How to install?

In order to execute the notebooks in this repository, you will need Python 3.8. The file requirements.txt in the root of the project contains the dependency list.

pip install -r requirements.txt 

Visualisation

To compare the different models we use plotly and matplotlib and we make a bar graph

Where is the code?

For cleaning, reshaping and checking the data:

  • Data preprocessing: preprocessing/1_data_cleaning_with_weather.ipynb
  • Feature correlation: preprocessing/2_check_correlation.ipynb

For training and evaluating the tree-based models:

  • Decision trees: decision_tree.ipynb
  • Random forest: random_forest.ipynb
  • Boosted trees: boosted_tree.ipynb