I will go through the whole process of creating a machine learning model on the famous Titanic dataset, which is used by many people all over the world. It provides information on the fate of passengers on the Titanic. By using it, I will predict whether a passenger on the titanic would have been survived or not.
Variable | Definition | Key | Veriable Notes |
---|---|---|---|
Survival | Survival | 0 = No, 1 = Yes | |
pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd | 1st = Upper, 2nd = Middle, 3rd = Lower |
sex | sex | ||
Age | Age in years | Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5 | |
sibsp | # of siblings / spouses aboard the Titanic | The dataset defines family relations | |
parch | # of parents / children aboard the Titanic | The dataset defines family relations | |
ticket | Ticket number | ||
fare | Passenger fare | ||
cabin | Cabin number | ||
embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
Find out how much each Feature is influencing Survived.
- Null Data Checking
- Pclass -Sex
- Sex and Pclass
- Age
- Age and Pclass
- Pclass, Sex and Age
- Embarked
- Family
- Fare
- Name Replacing
- Fill Null in Age
- Fill Nill in Embarked
- Change Initial, Embarked and Sex (string to numerical)
- One-hot encoding on Initial and Embarked
- Drop columns
- Build Model and predict using the trained model
- Feature Importance
- RandomForestClassifier
Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction.