This is my Capstone Project for the Udacity Machine Learning Engineer Nanodegree.
The goal is to build a machine learning model inspired by the following question posed under the “Inspiration” header of Kaggle’s 1.88 Million US Wildfires dataset;
“Given the size, location and date, can you predict the cause of a wildfire?”
The input features for my model include the wildfire’s date, time of discovery, (geopolitical) state, and estimated size (in acres). Its output will be a cause amongst all possible wildfire causes in the dataset.
More details can be found in the report.pdf
file.
The results of this project can be reproduced by loading the repo into an Amazon SageMaker Notebook Instance. The notebooks in the src/
directory can be executed in the order specified by their filename prefixes.
You may need to specify the following Notebook kernels:
conda_amazonei_mxnet_p36
for the the3_Train_Linear_Model.ipynb
notebook.conda_mxnet_python36
for the7_Interpretability.ipynb
notebook.conda_python3
for all other notebooks.
proposal.pdf
- Project proposal.report.pdf
- Overview of the project and results.
The data used in this project was obtained from Kaggle’s 1.88 Million US Wildfires dataset and is stored in the src/wildfire_data
directory. This data can be loaded using the function load_raw_wildfire_data
from src/utils.py
.
Notebooks in this repository are numbered and are meant to be executed in the order listed below. The 2_Feature_Engineering.ipynb
notebook generates new pkl
and csv
files into the src/wildfire_data
directory. These are essential to the other notebooks.
1_Data_Exploration.ipynb
- For exploratory purposes.2_Feature_Engineering.ipynb
- Preprocess and store train, validation and test data.3_Train_Linear_Model.ipynb
- Train a Linear Learner model.4_Train_MLP_Model.ipynb
- Train an MLP model.5_Train_XGB_Model.ipynb
- Train a XGBoost model.6_Train_Refined_MLP_Model.ipynb
- Train a refined MLP model with reduced output classes.7_Interpretability.ipynb
- Explore feature importance under MLP model.
utils.py
- Basic helper functions including data loading.get_model.py
- Helper function to download SageMaker trained models.mlp_source/train_mlp.py
- Entry point for MLP model training.
models/
- All models trained by notebooks 3, 4, 5, and 6 above, as archived by SageMaker.
When the 2_Feature_Engineering.ipynb
notebook is run, the following files are generated inside the src/wildfire_data
directory:
train.csv
,validation.csv
,test.csv
- CSV data for training, validation, and testing.train_ref.csv
,validation_ref.csv
,test_ref.csv
- CSV data for refined models with reduced output classes.causes_names.pkl
- List of all original wildfire cause names.causes_names_refinement.pkl
- List of reduced wildfire cause names.feature_names.pkl
- List of names of the features used for training and testing the models.
Kaggle’s 1.88 Million US Wildfires dataset
The above dataset was originally obtained from:
Short, Karen C. 2017. Spatial wildfire occurrence data for the United States, 1992-2015 [FPAFOD20170508]. 4th Edition. Fort Collins, CO: Forest Service Research Data Archive. https://doi.org/10.2737/RDS-2013-0009.4