Analysing titanic data and predicting Survivors based on Passenger class, Sex, Fare and Embarked location

This is an attempt to participate Kaggle's Machine Learning Prediction compitition (https://www.kaggle.com/c/titanic/data)

# import dependencies
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import math
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Load dataset
titanic_df = pd.read_csv('data/train.csv')

# Set default figure size
sns.set(rc={'figure.figsize':(12,8)})

# QUick look at the data
titanic_df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

# total num of passengers 
print(f'There are {len(titanic_df)} total passengers in this training dataset')

There are 891 total passengers in this training dataset

Data Cleanup

# Looking at column types and count
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

#Check for null values in each column
titanic_df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

# Heatmap of null column values to get a better idea. Dark blue stands for null, 
# and off white means no null value ofr the respective column for each point.
sns.heatmap(titanic_df.isnull(),cmap="YlGnBu")

<matplotlib.axes._subplots.AxesSubplot at 0x118cf1f28>

We can see that about 20% of Age and majority of Cabin calues are null. 2 null values for Embarked column also.

# Drop Cabin column
titanic_df.drop('Cabin', axis=1, inplace=True)

# Drop nulls from the dataframe
titanic_df.dropna(inplace=True)

# Verifying that we dont have any more nulls. 
# Notice we dont see any black bars on the heatmap so all nulls have been dropped. - 

sns.heatmap(titanic_df.isnull(),cmap="YlGnBu")

<matplotlib.axes._subplots.AxesSubplot at 0x127870f28>

Data Analysis

How many survivals for Male and Female

sns.set_style("whitegrid") 
sns.countplot(x="Survived", hue="Sex", data=titanic_df, palette="Set3")

<matplotlib.axes._subplots.AxesSubplot at 0x12795f278>

Most passengers who did not survive were Male

How many survivals by passenger class?

sns.countplot(x="Survived", hue="Pclass", data=titanic_df, palette="Set3")

<matplotlib.axes._subplots.AxesSubplot at 0x127b8e3c8>

Most Passengers who did not survive belonged to Class 3 i.e the lowest class. Most people who survived belonged to Class 1, then Class 3 and then Class 2.

# What was the age distribution on the titanic
titanic_df['Age'].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x127d4e9b0>

As we can see, the titanic was populated by more younger people of age < 30. Which means lots of children and young adults.

What were the ages of the survivors? What was the age distribution on the titanic

titanic_df[titanic_df['Survived'] == 1]['Age'].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x127ef6b70>

Ages of 20-40 were more likely to survive, followed by Ages below 5. We would be inclined to think that children were more likelt to survive but Ages 10-20 have a lower number. We need to ask ourself if that is because there were a low number of population for ages 10-20.

(There were close to 40 passengers from ages 10-20 and close to 20 survived)

Analysing the fare distribution. How many people paid what sums of fare on the ship?

titanic_df['Fare'].hist(bins=20,color='y')

<matplotlib.axes._subplots.AxesSubplot at 0x127faa320>

What were the Fares of the survivors?

Fare distribution on the titanic below -

titanic_df[titanic_df['Survived'] == 1]['Fare'].hist(color='y')

<matplotlib.axes._subplots.AxesSubplot at 0x128185358>

Analysing how many survived from the three embarking stations

S - Southamption

C - Cherbourg

Q - Queenstown

sns.countplot(x="Survived", hue="Embarked", data=titanic_df, palette="Set1")

<matplotlib.axes._subplots.AxesSubplot at 0x12834fe80>

Most people who survived were from Southamption, but most people who did not survive also boarded from Southamption. Its safe to say the majority of the ship came from Southamption.

How do the Age values match with different passenger class

sns.violinplot(x='Pclass',y="Age",data=titanic_df)
sns.swarmplot(x='Pclass',y="Age",data=titanic_df,color='0.2')

<matplotlib.axes._subplots.AxesSubplot at 0x1284e5898>

The mean ages for class 1 are higher than class 2, which is higher than class 3. This is intuitive becasue richer people tend to be of older age. Class 1 fare is the most expensive so has a higher mean age.

Putting it together, We can plot for survived/Not Survived for Males/Females by Age Group and Fare paid for the ship.

0 - Not Survived

1 - Survived

Red circles are Male while Blue circle markers are Female

sns.scatterplot(x="Age", y="Fare",
                      hue="Sex", data=titanic_df,palette="Set1", size="Survived")
sns.set_style("ticks", {"xtick.major.size": 12, "ytick.major.size": 12})
sns.set_context("paper", font_scale=1.4)

Preparing data for Logistic Regression

We need to convert string values into binary (0 or 1) values

# Converting the Embarked column into a numerical binary value for Q,S and C. If both Q and C are 0, 
# then the value would automatically be C

embarked = pd.get_dummies(titanic_df['Embarked'], drop_first='True')

embarked.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Q	S
0	0	1
1	0	0
2	0	1
3	0	1
4	0	1

# Converting the P assenger Class column into a numerical binary value for 1,2,3. If both 2 and 3 are 0, 
# then the value would automatically be class 1

pcl = pd.get_dummies(titanic_df['Pclass'], drop_first='True')
pcl.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	2	3
0	0	1
1	0	0
2	0	1
3	0	0
4	0	1

# Converting Sec column to a binary. If Male = 0, then the value would be a female automatically
sex = pd.get_dummies(titanic_df['Sex'], drop_first='True')
sex.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	male
0	1
1	0
2	0
3	0
4	1

# Combining the above dataframe to our titanic dataframe
titanic_df = pd.concat([titanic_df, embarked, pcl, sex], axis=1)

titanic_df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Embarked	S	3	male
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	S	1	1	1
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C	0	0	0
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	S	1	1	0
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	S	1	0	0
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	S	1	1	1

df_binary = titanic_df[["Survived","SibSp","Parch","Fare","Q","S",2,3,"male"]]

# Final dataset for Regression
df_binary.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Survived	SibSp	Fare	S	3	male
0	0	1	7.2500	1	1	1
1	1	1	71.2833	0	0	0
2	1	0	7.9250	1	1	0
3	1	1	53.1000	1	0	0
4	0	0	8.0500	1	1	1

Logistic Regression

Performing Machine Learning on the dataset prepared -

# Assigning dependant and independant variables 

# Survived column is our dependant variable. We are trying to predict this variable
y = df_binary['Survived']

# The other columns are out independant variables. Hence we will drop Survived column from the dataframe
X = df_binary.drop('Survived', axis = 1)

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=1)

classifier = LogisticRegression()
classifier

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

 classifier.fit(X_train, y_train)

/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

print(f"Training Data Score: {classifier.score(X_train, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test, y_test)}")

Training Data Score: 0.7978910369068541
Testing Data Score: 0.7482517482517482

# Predict
predictions = classifier.predict(X_test)
predictions

array([1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0,
       1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0,
       0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0])

# Comparison of our prediction with actual result
pd.DataFrame({"Prediction": predictions, "Actual": y_test})

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Prediction	Actual
689	1	1
279	1	1
508	0	0
9	1	1
496	1	1
150	0	0
474	1	0
469	1	1
794	0	0
864	0	0
553	0	1
226	0	1
204	0	1
713	0	0
751	0	1
349	0	0
74	0	1
321	0	0
743	0	0
873	0	0
647	0	1
327	1	1
684	0	0
769	0	0
91	0	0
272	1	1
770	0	0
27	1	0
141	1	1
733	0	0
...	...	...
741	0	0
636	0	0
672	0	0
345	1	1
68	0	1
357	1	0
514	0	0
81	0	1
231	0	0
881	0	0
174	0	0
188	0	0
419	1	0
319	1	1
876	0	0
808	0	0
706	1	1
534	1	0
554	1	1
90	0	0
99	0	0
608	1	1
869	0	1
148	0	0
666	0	0
582	0	0
44	1	1
236	0	0
780	1	1
884	0	0

143 rows × 2 columns

from sklearn.metrics import classification_report, accuracy_score

print(classification_report(predictions, y_test))

              precision    recall  f1-score   support

           0       0.80      0.78      0.79        87
           1       0.67      0.70      0.68        56

    accuracy                           0.75       143
   macro avg       0.74      0.74      0.74       143
weighted avg       0.75      0.75      0.75       143

Accuracy of our model

accuracy_score(predictions,y_test)

0.7482517482517482

Testing Kaggle's test dataset

test_df = pd.read_csv('data/test.csv')

passengerId = test_df['PassengerId']

test_df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	892	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q
1	893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S
2	894	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q
3	895	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.6625	NaN	S
4	896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.2875	NaN	S

Prepare/Clean Kaggle's test data for our model

# Converting the Embarked column into a numerical binary value for Q,S and C. If both Q and C are 0, 
# then the value would automatically be C

embarked_test = pd.get_dummies(test_df['Embarked'], drop_first='True')
pcl_test = pd.get_dummies(test_df['Pclass'], drop_first='True')
sex_test = pd.get_dummies(test_df['Sex'], drop_first='True')
test_df = pd.concat([test_df, embarked_test, pcl_test, sex_test], axis=1)
test_df = test_df[["SibSp","Parch","Fare","Q","S",2,3,"male"]]
test_df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	SibSp	Parch	Fare	Q	S	2	3	male
0	0	0	7.8292	1	0	0	1	1
1	1	0	7.0000	0	1	0	1	0
2	0	0	9.6875	1	0	1	0	1
3	0	0	8.6625	0	1	0	1	1
4	1	1	12.2875	0	1	0	1	0

Make Predictions using our model with Kaggle's test dataset

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	SibSp	Fare	S	3	male
0	1	7.2500	1	1	1
1	1	71.2833	0	0	0
2	0	7.9250	1	1	0
3	1	53.1000	1	0	0
4	0	8.0500	1	1	1

test_df.isnull().count()

SibSp    418
Parch    418
Fare     418
Q        418
S        418
2        418
3        418
male     418
dtype: int64

test_df = test_df.fillna(test_df.mean())

prediction = classifier.predict(test_df)

output = pd.DataFrame({"PassengerId": passengerId,"Survived" : prediction})

output

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	PassengerId	Survived
0	892	0
1	893	1
2	894	0
3	895	0
4	896	1
5	897	0
6	898	1
7	899	0
8	900	1
9	901	0
10	902	0
11	903	0
12	904	1
13	905	0
14	906	1
15	907	1
16	908	0
17	909	0
18	910	1
19	911	1
20	912	0
21	913	0
22	914	1
23	915	0
24	916	1
25	917	0
26	918	1
27	919	0
28	920	0
29	921	0
...	...	...
388	1280	0
389	1281	0
390	1282	0
391	1283	1
392	1284	0
393	1285	0
394	1286	0
395	1287	1
396	1288	0
397	1289	1
398	1290	0
399	1291	0
400	1292	1
401	1293	0
402	1294	1
403	1295	0
404	1296	0
405	1297	0
406	1298	0
407	1299	1
408	1300	1
409	1301	1
410	1302	1
411	1303	1
412	1304	1
413	1305	0
414	1306	1
415	1307	0
416	1308	0
417	1309	0

418 rows × 2 columns

#export output to csv 
output.to_csv('data/output.csv',index=False)

	Prediction	Actual
689	1	1
279	1	1
508	0	0
9	1	1
496	1	1
150	0	0
474	1	0
469	1	1
794	0	0
864	0	0
553	0	1
226	0	1
204	0	1
713	0	0
751	0	1
349	0	0
74	0	1
321	0	0
743	0	0
873	0	0
647	0	1
327	1	1
684	0	0
769	0	0
91	0	0
272	1	1
770	0	0
27	1	0
141	1	1
733	0	0
...	...	...
741	0	0
636	0	0
672	0	0
345	1	1
68	0	1
357	1	0
514	0	0
81	0	1
231	0	0
881	0	0
174	0	0
188	0	0
419	1	0
319	1	1
876	0	0
808	0	0
706	1	1
534	1	0
554	1	1
90	0	0
99	0	0
608	1	1
869	0	1
148	0	0
666	0	0
582	0	0
44	1	1
236	0	0
780	1	1
884	0	0

	Prediction	Actual
689	1	1
279	1	1
508	0	0
9	1	1
496	1	1
150	0	0
474	1	0
469	1	1
794	0	0
864	0	0
553	0	1
226	0	1
204	0	1
713	0	0
751	0	1
349	0	0
74	0	1
321	0	0
743	0	0
873	0	0
647	0	1
327	1	1
684	0	0
769	0	0
91	0	0
272	1	1
770	0	0
27	1	0
141	1	1
733	0	0
...	...	...
741	0	0
636	0	0
672	0	0
345	1	1
68	0	1
357	1	0
514	0	0
81	0	1
231	0	0
881	0	0
174	0	0
188	0	0
419	1	0
319	1	1
876	0	0
808	0	0
706	1	1
534	1	0
554	1	1
90	0	0
99	0	0
608	1	1
869	0	1
148	0	0
666	0	0
582	0	0
44	1	1
236	0	0
780	1	1
884	0	0

ppainuly/Titanic-Machine-Learning

Analysing titanic data and predicting Survivors based on Passenger class, Sex, Fare and Embarked location

This is an attempt to participate Kaggle's Machine Learning Prediction compitition (https://www.kaggle.com/c/titanic/data)

Data Cleanup

We can see that about 20% of Age and majority of Cabin calues are null. 2 null values for Embarked column also.

Data Analysis

How many survivals for Male and Female

Most passengers who did not survive were Male

How many survivals by passenger class?

Most Passengers who did not survive belonged to Class 3 i.e the lowest class. Most people who survived belonged to Class 1, then Class 3 and then Class 2.

As we can see, the titanic was populated by more younger people of age < 30. Which means lots of children and young adults.

What were the ages of the survivors? What was the age distribution on the titanic

Ages of 20-40 were more likely to survive, followed by Ages below 5. We would be inclined to think that children were more likelt to survive but Ages 10-20 have a lower number. We need to ask ourself if that is because there were a low number of population for ages 10-20.

(There were close to 40 passengers from ages 10-20 and close to 20 survived)

Analysing the fare distribution. How many people paid what sums of fare on the ship?

What were the Fares of the survivors?

Fare distribution on the titanic below -

Analysing how many survived from the three embarking stations

S - Southamption

C - Cherbourg

Q - Queenstown

Most people who survived were from Southamption, but most people who did not survive also boarded from Southamption. Its safe to say the majority of the ship came from Southamption.

How do the Age values match with different passenger class

The mean ages for class 1 are higher than class 2, which is higher than class 3. This is intuitive becasue richer people tend to be of older age. Class 1 fare is the most expensive so has a higher mean age.

Putting it together, We can plot for survived/Not Survived for Males/Females by Age Group and Fare paid for the ship.

0 - Not Survived

1 - Survived

Red circles are Male while Blue circle markers are Female

Preparing data for Logistic Regression

We need to convert string values into binary (0 or 1) values

Logistic Regression

Performing Machine Learning on the dataset prepared -

Accuracy of our model

Testing Kaggle's test dataset

Prepare/Clean Kaggle's test data for our model

Make Predictions using our model with Kaggle's test dataset

	Prediction	Actual
689	1	1
279	1	1
508	0	0
9	1	1
496	1	1
150	0	0
474	1	0
469	1	1
794	0	0
864	0	0
553	0	1
226	0	1
204	0	1
713	0	0
751	0	1
349	0	0
74	0	1
321	0	0
743	0	0
873	0	0
647	0	1
327	1	1
684	0	0
769	0	0
91	0	0
272	1	1
770	0	0
27	1	0
141	1	1
733	0	0
...	...	...
741	0	0
636	0	0
672	0	0
345	1	1
68	0	1
357	1	0
514	0	0
81	0	1
231	0	0
881	0	0
174	0	0
188	0	0
419	1	0
319	1	1
876	0	0
808	0	0
706	1	1
534	1	0
554	1	1
90	0	0
99	0	0
608	1	1
869	0	1
148	0	0
666	0	0
582	0	0
44	1	1
236	0	0
780	1	1
884	0	0