Udacity Azure ML Nanodegree Capstone Project - Predicting the Survival of Titanic Passengers

This project uses Kaggle Titanic Prediction Dataset in azure workspace to train models using the different tools and deploy the best machine learning model as a web service using python sdk.

Project Pipeline

Pipeline

The sinking of the Titanic is one of the most infamous shipwrecks in history.On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. This resulted in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this project, I build a classification model to predictive whether the passenger will survive or not.

Project Set Up and Installation

To run this project, you will need an active account on Kaggle. From Kaggle, search for Titanic, enter the competition and download the dataset. Import the data into your Azure ML Studio

Dataset

Overview

We use this dataset from Kaggle Titanic Prediction Dataset. The data dictionary is as below:

  • survival (Survival 0 = No, 1 = Yes)
  • pclass - Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
  • sex - Gender
  • Age - Age in years
  • sibsp - # of siblings / spouses aboard the Titanic
  • parch - # of parents / children aboard the Titanic
  • ticket - Ticket number
  • fare - Passenger fare
  • cabin Cabin number
  • embarked- Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

Task

The task for this project is to train models to classify whether a passenger will survive the titanic shipwreck The 'Survived' column in the dataset is 1 when the passenger survived from the shipwrecks and 0 when he/she didnt survive.

Access

Download the data from Kaggle,once the data is downloaded, register it to the Azure ML workspace. There is some data cleaning done before the model is trained.

Automated ML

The automl notebook will run you through the steps of configuring and running the AutoML experiment. We do a Classification task on the 'Survived' column from the titanic dataset. We also set the primary metric to 'accuracy' with auto featurization, and a timeout set at 30 minutes.

Results

The below snapshot shows different models generated by automl feature.

AutoML RunDetails Notebook

AutoML RunDetails

The voting ensebmble model was the best with an accuracy of 82.9% .

Best Model

best model complete

Hyperparameter Tuning

The hyperparamter tuning notebook will run you through the steps for the Hyperdrive run. I have chosen the Random Forest model. Random Forest models generally provide a high accuracy because the are ensemble models(bagging).

For the hyperparameter tuning of this model, we will be tuning four different paramaters for the forest using a random parameter sampling:

  • n_estimators: The number of trees in the Random forrest
  • max_depth: The maximum depth of the trees in the forrest
  • min_samples_split: The minimum number of samples required to split an internal node
  • min_samples_leaf: The minimum number of samples required to be at a leaf node

Results

We got an accuracy of 86.8%

hyperdrivel

hyperdrive run

Model Deployment

I deployed the model(Voting Ensemble) generated by automl model generated by automl. I registered and deployed this model as a web service using ACI (Azure Container Instance).The sample data feeded to the deployed model as a web service request as shown below

Web Service 1

Web Service 2

Web Service 3

Screen Recording

Below is the link to the video recording

Azure ML Capstone

Standout Suggestions

In this deployment I have enabled Application Insight which helps in logging and monitoring of web service.

Future work

  • Do more work on data cleaning eg the name column and feature engineering to get more valuable columns.
  • Work on conversion of registered model to ONNX format.
  • Audit the models for overfitting and what measures can be put in place to deal with imbalanced dataset.