Udacity Azure ML Nanodegree Capstone Project - Predicting the Survival of Titanic Passengers

This project uses Kaggle Titanic Prediction Dataset in azure workspace to train models using the different tools and deploy the best machine learning model as a web service using python sdk.

Project Pipeline

The sinking of the Titanic is one of the most infamous shipwrecks in history.On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. This resulted in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this project, I build a classification model to predictive whether the passenger will survive or not.

Project Set Up and Installation

To run this project, you will need an active account on Kaggle. From Kaggle, search for Titanic, enter the competition and download the dataset. Import the data into your Azure ML Studio

Dataset

Overview

We use this dataset from Kaggle Titanic Prediction Dataset. The data dictionary is as below:

survival (Survival 0 = No, 1 = Yes)
pclass - Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
sex - Gender
Age - Age in years
sibsp - # of siblings / spouses aboard the Titanic
parch - # of parents / children aboard the Titanic
ticket - Ticket number
fare - Passenger fare
cabin Cabin number
embarked- Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

Task

The task for this project is to train models to classify whether a passenger will survive the titanic shipwreck The 'Survived' column in the dataset is 1 when the passenger survived from the shipwrecks and 0 when he/she didnt survive.

Access

Download the data from Kaggle,once the data is downloaded, register it to the Azure ML workspace. There is some data cleaning done before the model is trained.

Automated ML

The automl notebook will run you through the steps of configuring and running the AutoML experiment. We do a Classification task on the 'Survived' column from the titanic dataset. We also set the primary metric to 'accuracy' with auto featurization, and a timeout set at 30 minutes.

Results

The below snapshot shows different models generated by automl feature.

The voting ensebmble model was the best with an accuracy of 82.9% .

Hyperparameter Tuning

The hyperparamter tuning notebook will run you through the steps for the Hyperdrive run. I have chosen the Random Forest model. Random Forest models generally provide a high accuracy because the are ensemble models(bagging).

For the hyperparameter tuning of this model, we will be tuning four different paramaters for the forest using a random parameter sampling:

n_estimators: The number of trees in the Random forrest
max_depth: The maximum depth of the trees in the forrest
min_samples_split: The minimum number of samples required to split an internal node
min_samples_leaf: The minimum number of samples required to be at a leaf node

Results

We got an accuracy of 86.8%

Model Deployment

I deployed the model(Voting Ensemble) generated by automl model generated by automl. I registered and deployed this model as a web service using ACI (Azure Container Instance).The sample data feeded to the deployed model as a web service request as shown below

Screen Recording

Below is the link to the video recording

Standout Suggestions

In this deployment I have enabled Application Insight which helps in logging and monitoring of web service.

Future work

Do more work on data cleaning eg the name column and feature engineering to get more valuable columns.
Work on conversion of registered model to ONNX format.
Audit the models for overfitting and what measures can be put in place to deal with imbalanced dataset.

billy-odera/nd00333-capstone

Udacity Azure ML Nanodegree Capstone Project - Predicting the Survival of Titanic Passengers

Project Pipeline

Project Set Up and Installation

Dataset

Overview

Task

Access

Automated ML

Results

Hyperparameter Tuning

Results

Model Deployment

Screen Recording

Standout Suggestions

Future work