Completed by Mangaliso Makhoba.
Overview: This project is using the Titanic Dataset to create a simple statitistical model that will return a conditional survival probabily of a passenger given a condition on a numerical variable from the dataset.
Problem Statement: Build a model that will return a passengers survival chance given a passengers detail as input.
Data: Titanic Kaggle Challenge
Deliverables: Probability
- Statistical Modeling
- Imputation of Missing values
- Probability
- Scikit-learn
- Jupyter Notebook
Ensure that the following packages have been installed and imported.
pip install numpy
pip install pandas
Follow instruction on https://docs.anaconda.com/anaconda/install/ to install Anaconda with Jupyter.
Alternatively: VS Code can render Jupyter Notebooks
The structure of this notebook is as follows:
- First, we'll load our data to get a view of the predictor and response variables we will be modeling.
- We determine the number of missing values for a specific column
- We'll then preprocess our data by imputing missing values, mean in numerical features, and mode in categorical feaures.
- We then model the survival probabilty of a passenger given their age, class, gender and so on
A function that determines the number of missing entries for a specified column in the dataset. The function should return an int
that corresponds to the number of missing entries in the specified column.
Function Specifications:
- Should take a pandas
DataFrame
and acolumn_name
as input and return aint
as output. - The
int
should be the number of missing entries in the column. - Should be generalised to be able to work on ANY dataframe.
Expected Outputs:
total_missing(df,'Age') == 177
total_missing(df,'Survived') == 0
Write a function that takes in as input a dataframe and a column name, and returns the mean
for numerical columns and the mode
for non-numerical columns.
Function Specifications:
- The function should take two inputs:
(df, column_name)
, wheredf
is a pandasDataFrame
,column_name
is astr
. - If the
column_name
does not exist indf
, raise aValueError
. - Should return as output the
mean
if the specified column is numerical and return a list of themode(s)
otherwise. - The mean should be rounded to 2 decimal places.
- If there is more than one
mode
for a given non-numerical column, the fuction should return a list of all modes.
Expected Outputs:
calc_mean_mode(df, 'Age') == 29.7
calc_mean_mode(df, 'Embarked') == ['S']
We ultimately want to predict the survival chances of the passengers in the testing set. We can start by building a simple model using the data we already have by using conditional probability ! Write a function that returns the survival probability of a passenger, given a condition on a numerical variable from the dataset. The condition will consist of a column_name
, a value
and a boolean_operator
. Possible boolean operators include "<"
,">"
, or "=="
. For example, column_name = "Age"
, boolean_operator = ">"
, and value = 40
together form the condition Age > 40
.
Function specifications:
- The function should make use of the
df_clean
DataFrame
loaded earlier in this notebook. - It should take a numerical
column_name
string, aboolean_operator
string, and avalue
of type string as input. - It should return a survival likelihood as a number between 0 and 1, rounded to 2 decimal places.
- Assume that
column_name
exists indf_clean
.
Expected Outputs:
survival_likelihood(df_clean,"Pclass","==","3") == 0.24
survival_likelihood(df_clean,"Age","<","15") == 0.58
Finding an appropriate strategy to impute missing values is very important to increasing the accuracy of the model you are building.
Authors: Mangaliso Makhoba, Explore Data Science Academy
Contact: makhoba808@gmail.com
This is project is complete
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.