Titanic-Machine_Learning_from_Disaster

Use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

Author

Piyal Banik

Created on 24th July, 2021

Acknowledgements

Project Pipeline

Business Understanding
Analytical Approach
Data requirements
Data collection
Data Understanding
Data Preparation
Modeling
Evaluation

Libraries used

Numpy
Pandas
Seaborn
Sickit-Learn

  pip install numpy, pandas, seaborn, sklearn

Business Understanding

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

The challenge here is - Given a passenger's information, how can we predict whether he/she survived the Titanic disaster?

Analytical Approach

Our target variable is categorical (survived / not survived), and hence we need classification models for this task.

Data requirements

We would require onboard passengers information which might include name, age, fare, gender, class.

Data collection

We are given two datasets both of which are CSV files, one for training our model named as train.csv and the other test.csv to test if our model can determine survival based on observations, not having the survival info.

Data Understanding

This step is part of Exploratory Data Analysis

There are 891 observations in the training dataset with each having 12 columns. 11 of them are predictor variables and 1 being target variable.

There are few different types of variables available.

Continous: Age, Fare
Discrete: SibSp, Parch
Categorical: Survived, Sex, and Embarked
Ordinal: Pclass
Mixed: Ticket
Alphanumeric: Cabin

There were 3 features having missing values.

Cabin
Age
Embarked

Cabin has way too many missing values and hence it is better to drop.

As per the training dataset, there were more male present compared to female and most of the people didnot survive. But females had better survival rate than males. It was also found out that survived passengers had paid more fare than the ones that did not survive. At the same time Pclass = 1 had better survival rate than the rest 2 classes.

Majority of the passengers were between the age group 15-35 but most of them didnot survived. Children aged < 4 and old aged people had higher survival rate.

Data Preparation / Feature Engineering

After closely looking into the dataset, variables types, values, amount of missing values present, I have decided to

Impute the missing Age values
Turn age into an ordinal feature
Impute the missing Embarked values
Drop Cabin [too many missing values]
Drop Ticket [many duplicates]
Drop PassengerID, Name, SibSp, Parch [not helpful]

I also performed some feature engineering as there are few categorical variables present. Created Dummy Variables for

Sex
Embarked

Modeling

Models trained

Logistic Regression
k-Nearest Neighbors
Support Vector Machines
Naive Bayes classifier
Decision Tree
Random Forest

Evaluation

Decision Tree and Random Forest achieved the maximum accuracy of 93.03%.

Piyal-Banik/Titanic-Machine_Learning_from_Disaster