Anton Fefilov 17 september 2016
This is my first attempt to make ML research myself after awesome Andrew Ng's Coursera course on Machine Learning. As a basis for this work I took the awesome kernel Exploring the Titanic Dataset authored by Megan Risdal and replicated it with Python. Also I took a lot of useful information for this work in A Journey through Titanic kernel by Omar El Gabry.
There are three parts to my script as follows:
- Feature engineering
- Missing value imputation
- Prediction!
# Load packages
# pandas
import pandas as pd
from pandas import Series, DataFrame
# numpy, matplotlib, seaborn
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.graphics.mosaicplot import mosaic
import seaborn as sns
# configure seaborn
sns.set_style('whitegrid')
# draw graphics inline
%matplotlib inline
# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
# imputation
from fancyimpute import MICE
Now that our packages are loaded, let’s read in and take a peek at the data.
# get titanic & test csv files as a DataFrame
train_df = pd.read_csv("input/train.csv")
test_df = pd.read_csv("input/test.csv")
# preview the data structure
train_df.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
# bind train and test data
titanic_df = train_df.append(test_df, ignore_index=True)
# get sense of variables and datatypes
titanic_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
Age 1046 non-null float64
Cabin 295 non-null object
Embarked 1307 non-null object
Fare 1308 non-null float64
Name 1309 non-null object
Parch 1309 non-null int64
PassengerId 1309 non-null int64
Pclass 1309 non-null int64
Sex 1309 non-null object
SibSp 1309 non-null int64
Survived 891 non-null float64
Ticket 1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB
We’ve got a sense of our variables, their class type, and the first few observations of each. We know we’re working with 1309 observations of 12 variables. To make things a bit more explicit since a couple of the variable names aren’t 100% illuminating, here’s what we’ve got to deal with:
Variable Name | Description |
---|---|
PassengerId | Passenger's id |
Survived | Survived (1) or died (0) |
Pclass | Passenger’s class |
Name | Passenger’s name |
Sex | Passenger’s sex |
Age | Passenger’s age |
SibSp | Number of siblings/spouses aboard |
Parch | Number of parents/children aboard |
Ticket | Ticket number |
Fare | Fare |
Cabin | Cabin |
Embarked | Port of embarkation |
The first variable which catches my attention is passenger name because we can break it down into additional meaningful variables which can feed predictions or be used in the creation of additional new variables. For instance, passenger title is contained within the passenger name variable and we can use surname to represent families. Let’s do some feature engineering!
# Make a copy of the titanic data frame
title_df = titanic_df.copy()
# Grab title from passenger names
title_df["Name"].replace(to_replace='(.*, )|(\\..*)', value='', inplace=True, regex=True)
title_df.head()
Age | Cabin | Embarked | Fare | Name | Parch | PassengerId | Pclass | Sex | SibSp | Survived | Ticket | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22.0 | NaN | S | 7.2500 | Mr | 0 | 1 | 3 | male | 1 | 0.0 | A/5 21171 |
1 | 38.0 | C85 | C | 71.2833 | Mrs | 0 | 2 | 1 | female | 1 | 1.0 | PC 17599 |
2 | 26.0 | NaN | S | 7.9250 | Miss | 0 | 3 | 3 | female | 0 | 1.0 | STON/O2. 3101282 |
3 | 35.0 | C123 | S | 53.1000 | Mrs | 0 | 4 | 1 | female | 1 | 1.0 | 113803 |
4 | 35.0 | NaN | S | 8.0500 | Mr | 0 | 5 | 3 | male | 0 | 0.0 | 373450 |
# Show title counts by sex
title_df.groupby(["Sex", "Name"]).size().unstack(fill_value=0)
Name | Capt | Col | Don | Dona | Dr | Jonkheer | Lady | Major | Master | Miss | Mlle | Mme | Mr | Mrs | Ms | Rev | Sir | the Countess |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Sex | ||||||||||||||||||
female | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 260 | 2 | 1 | 0 | 197 | 2 | 0 | 0 | 1 |
male | 1 | 4 | 1 | 0 | 7 | 1 | 0 | 2 | 61 | 0 | 0 | 0 | 757 | 0 | 0 | 8 | 1 | 0 |
# Titles with very low cell counts to be combined to "rare" level
rare_titles = ['Dona', 'Lady', 'the Countess','Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer']
title_df.replace(rare_titles, "Rare title", inplace=True)
# Also reassign mlle, ms, and mme accordingly
title_df.replace(["Mlle","Ms", "Mme"], ["Miss", "Miss", "Mrs"], inplace=True)
# Show title counts by sex
title_df.groupby(["Sex", "Name"]).size().unstack(fill_value=0)
Name | Master | Miss | Mr | Mrs | Rare title |
---|---|---|---|---|---|
Sex | |||||
female | 0 | 264 | 0 | 198 | 4 |
male | 61 | 0 | 757 | 0 | 25 |
# Finally, grab surname from passenger name
uniq_surname_size = titanic_df["Name"].str.split(',').str.get(0).unique().size
print("We have %(uniq_surname_size)s unique surnames. I would be interested to infer ethnicity based on surname --- another time." % locals())
We have 875 unique surnames. I would be interested to infer ethnicity based on surname --- another time.
Now that we’ve taken care of splitting passenger name into some new variables, we can take it a step further and make some new family variables. First we’re going to make a family size variable based on number of siblings/spouse(s) (maybe someone has more than one spouse?) and number of children/parents.
# Make a copy of the titanic data frame
family_df = titanic_df.loc[:,["Parch", "SibSp", "Survived"]]
# Create a family size variable including the passenger themselves
family_df["Fsize"] = family_df.SibSp + family_df.Parch + 1
family_df.head()
Parch | SibSp | Survived | Fsize | |
---|---|---|---|---|
0 | 0 | 1 | 0.0 | 2 |
1 | 0 | 1 | 1.0 | 2 |
2 | 0 | 0 | 1.0 | 1 |
3 | 0 | 1 | 1.0 | 2 |
4 | 0 | 0 | 0.0 | 1 |
# make figure wider
plt.figure(figsize=(15,5))
# visualize the relationship between family size & survival
sns.countplot(x='Fsize', hue="Survived", data=family_df)
<matplotlib.axes._subplots.AxesSubplot at 0x7f77dad54f10>
Ah hah. We can see that there’s a survival penalty to singletons and those with family sizes above 4. We can collapse this variable into three levels which will be helpful since there are comparatively fewer large families. Let’s create a discretized family size variable.
# Discretize family size
family_df.ix[family_df.Fsize > 4, "Fsize"] = "large"
family_df.ix[family_df.Fsize == 1, "Fsize"] = 'singleton'
family_df.ix[(family_df.Fsize < 5) & (family_df.Fsize > 1), "Fsize"] = "small"
family_df.head(10)
Parch | SibSp | Survived | Fsize | |
---|---|---|---|---|
0 | 0 | 1 | 0.0 | small |
1 | 0 | 1 | 1.0 | small |
2 | 0 | 0 | 1.0 | singleton |
3 | 0 | 1 | 1.0 | small |
4 | 0 | 0 | 0.0 | singleton |
5 | 0 | 0 | 0.0 | singleton |
6 | 0 | 0 | 0.0 | singleton |
7 | 1 | 3 | 0.0 | large |
8 | 2 | 0 | 1.0 | small |
9 | 0 | 1 | 1.0 | small |
# Show family size by survival using a mosaic plot
mosaic(family_df, ['Fsize', 'Survived'], title="Family size by survival")
(<matplotlib.figure.Figure at 0x7f77dc09fb50>,
OrderedDict([(('small', '0.0'), (0.0, 0.0, 0.3244768921336578, 0.41983343193919809)), (('small', '1.0'), (0.0, 0.42315569107541073, 0.3244768921336578, 0.57684430892458916)), (('singleton', '0.0'), (0.3294273871831628, 0.0, 0.5967263393005967, 0.6941479982924702)), (('singleton', '1.0'), (0.3294273871831628, 0.69747025742868274, 0.5967263393005967, 0.30252974257131715)), (('large', '0.0'), (0.9311042215332644, 0.0, 0.06889577846673557, 0.83592326653091842)), (('large', '1.0'), (0.9311042215332644, 0.83924552566713095, 0.06889577846673557, 0.16075447433286888))]))
The mosaic plot shows that we preserve our rule that there’s a survival penalty among singletons and large families, but a benefit for passengers in small families. I want to do something further with our age variable, but 263 rows have missing age values, so we will have to wait until after we address missingness.
What’s left? There’s probably some potentially useful information in the passenger cabin variable including about their deck. Let’s take a look.
# This variable appears to have a lot of missing values
titanic_df["Cabin"].head(28)
0 NaN
1 C85
2 NaN
3 C123
4 NaN
5 NaN
6 E46
7 NaN
8 NaN
9 NaN
10 G6
11 C103
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
18 NaN
19 NaN
20 NaN
21 D56
22 NaN
23 A6
24 NaN
25 NaN
26 NaN
27 C23 C25 C27
Name: Cabin, dtype: object
# The first character is the deck. For example:
list(titanic_df["Cabin"][1])
['C', '8', '5']
# The number of not-null Cabin values
titanic_df["Cabin"].count()
295
There’s more that likely could be done here including looking into cabins with multiple rooms listed (e.g., row 28: “C23 C25 C27”), but given the sparseness of the column we’ll stop here.
Now we’re ready to start exploring missing data and rectifying it through imputation. There are a number of different ways we could go about doing this. Given the small size of the dataset, we probably should not opt for deleting either entire observations (rows) or variables (columns) containing missing values. We’re left with the option of either replacing missing values with a sensible values given the distribution of the data, e.g., the mean, median or mode. Finally, we could go with prediction. We’ll use both of the two latter methods and I’ll rely on some data visualization to guide our decisions.
# Let's find passengers with missing Embarked feature
np.where(titanic_df["Embarked"].isnull())[0]
array([ 61, 829])
titanic_df.ix[[61,829]]
Age | Cabin | Embarked | Fare | Name | Parch | PassengerId | Pclass | Sex | SibSp | Survived | Ticket | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
61 | 38.0 | B28 | NaN | 80.0 | Icard, Miss. Amelie | 0 | 62 | 1 | female | 0 | 1.0 | 113572 |
829 | 62.0 | B28 | NaN | 80.0 | Stone, Mrs. George Nelson (Martha Evelyn) | 0 | 830 | 1 | female | 0 | 1.0 | 113572 |
We will infer their values for embarkment based on present data that we can imagine may be relevant: passenger class and fare. We see that they paid $
80 and $
80 respectively and their classes are 1 and 1 . So from where did they embark?
# Get rid of our missing passenger IDs
embark_fare_df = titanic_df.drop([61,829], axis=0)
# Let's visualize embarkment, passenger class ...
sns.factorplot(x='Embarked',y='Fare', hue='Pclass', kind="box",order=['C', 'Q', 'S'],data=embark_fare_df, size=7,aspect=2)
# ... and median fare
plt.axhline(y=80, color='r', ls='--')
<matplotlib.lines.Line2D at 0x7f77daa9d590>
Voilà! The median fare for a first class passenger departing from Charbourg (‘C’) coincides nicely with the $80 paid by our embarkment-deficient passengers. I think we can safely replace the NA values with ‘C’.
# Since their fare was $80 for 1st class, they most likely embarked from 'C'
titanic_df.loc[[61,829],"Embarked"] = 'C'
We’re close to fixing the handful of NA values here and there. Passenger on row 1044 has an NA Fare value.
# Let's find passengers with missing Fare feature
np.where(titanic_df["Fare"].isnull())[0]
array([1043])
titanic_df.ix[[1043]]
Age | Cabin | Embarked | Fare | Name | Parch | PassengerId | Pclass | Sex | SibSp | Survived | Ticket | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1043 | 60.5 | NaN | S | NaN | Storey, Mr. Thomas | 0 | 1044 | 3 | male | 0 | NaN | 3701 |
This is a third class passenger who departed from Southampton (‘S’). Let’s visualize Fares among all others sharing their class and embarkment (n = 494).
facet = sns.FacetGrid(titanic_df[(titanic_df['Pclass'] == 3) & (titanic_df['Embarked'] == 'S')], aspect=4)
facet.map(sns.kdeplot,'Fare',shade= True)
facet.set(xlim=(0, 80))
fare_median = titanic_df[(titanic_df['Pclass'] == 3) & (titanic_df['Embarked'] == 'S')]['Fare'].median()
plt.axvline(x=fare_median, color='r', ls='--')
<matplotlib.lines.Line2D at 0x7f77dc958bd0>
From this visualization, it seems quite reasonable to replace the NA Fare value with median for their class and embarkment which is $
8.05.
# Replace missing fare value with median fare for class/embarkment
titanic_df.loc[[1043],'Fare'] = fare_median
3.2 Predictive imputation
Finally, as we noted earlier, there are quite a few missing Age values in our data. We are going to get a bit more fancy in imputing missing age values. Why? Because we can. We will create a model predicting ages based on other variables.
# Show number of missing Age values
titanic_df["Age"].isnull().sum()
263
fig, (axis1,axis2) = plt.subplots(1,2,figsize=(15,4))
axis1.set_title('Original Age values - Titanic')
axis2.set_title('New Age values - Titanic')
# plot original Age values
# NOTE: drop all null values, and convert to int
titanic_df['Age'].dropna().astype(int).hist(bins=70, ax=axis1)
# get average, std, and number of NaN values
average_age = titanic_df["Age"].mean()
std_age = titanic_df["Age"].std()
count_nan_age = titanic_df["Age"].isnull().sum()
# generate random numbers between (mean - std) & (mean + std)
rand_age = np.random.randint(average_age - std_age, average_age + std_age, size = count_nan_age)
# fill NaN values in Age column with random values generated
age_slice = titanic_df["Age"].copy()
age_slice[np.isnan(age_slice)] = rand_age
# plot imputed Age values
age_slice.astype(int).hist(bins=70, ax=axis2)
<matplotlib.axes._subplots.AxesSubplot at 0x7f77d9938fd0>
Things look good, so let’s replace our Age vector in the original data with the new values
titanic_df["Age"] = age_slice
# Show number of missing Age values
titanic_df["Age"].isnull().sum()
0
We’ve finished imputing values for all variables that we care about for now! Now that we have a complete Age variable, there are just a few finishing touches I’d like to make. We can use Age to do just a bit more feature engineering …