/titanic-survival-prediction

Titanic survival prediction with machine learning classifiers.

Primary LanguageJupyter Notebook

Titanic survival prediction with machine learning classifiers.

Contents

  1. Introduction
  2. Data
  3. Data preprocessing
  4. Model
  5. Evaluation
  6. Conclusion
  7. References

Introduction

This dataset is a subset of the Titanic dataset which is a large dataset of passengers on the Titanic. The Titanic is the world's largest passenger liner, and was supposed to ship around 2,200 people. It sank in the North Atlantic Ocean in 1912. We miss the Titanic, but we can still learn a lot.

This problem is a classification problem. The goal is to predict whether a passenger survived the sinking of the Titanic.

To visualize model brew install graphviz

Data

Data is provided in the Kaggle Titanic competition. You can download and unzip the data. It consists of a training set of 891 entries and a test set of 418 entries. The training set is used to train the model, and the test set is used to evaluate the model. The data is in CSV format.

Columns are: (PassengerId, Survived, Survived, Pclass, Name, ..., Fare)

The first column is the PassengerId, which is a unique identifier for each passenger.

Data preprocessing

Data preprocessing is the process of converting the data into a form that is suitable for training a machine learning model. Basically, we need to convert the data into a form that is suitable for the model.

  • FamSize created by combining SibSp and Parch.
  • Age is filled with the median value.
  • Embarked is filled with the most frequent value.

Model

There are many machine learning models that can be used to predict the outcome of a problem. In this project we will use a simple model called a decision tree. The decision tree is a tree-based model that can be used to predict the survival of a passenger.

Used models:

Evaluation

Evaluation is the process of determining the accuracy of a model. The accuracy is the fraction of the predictions. Cross-validation is a technique that is used to determine the accuracy of a model. Confusion matrix is used to get better understanding of the model evaluation. Different evaulation scores are:

F1 score: F1 score is a measure of the effectiveness of a classification model. It is the harmonic mean of precision. Precision: Precision is a measure of the model's positive prediction. It is the fraction of the true positives. Recall: Recall is a measure of the model's true positive prediction. It is the fraction of the true positives. Accuracy: Accuracy is the fraction of the correct predictions.

Conclusion

Machine learning is a very powerful tool that can be used to predict the outcome of a problem. Maybe it will save next Titanic disaster and maybe AI will save the world one-day unlike the Terminator.