#Check for null values in each columntitanic_df.isnull().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
# Heatmap of null column values to get a better idea. Dark blue stands for null, # and off white means no null value ofr the respective column for each point.sns.heatmap(titanic_df.isnull(),cmap="YlGnBu")
<matplotlib.axes._subplots.AxesSubplot at 0x118cf1f28>
We can see that about 20% of Age and majority of Cabin calues are null. 2 null values for Embarked column also.
# Drop Cabin columntitanic_df.drop('Cabin', axis=1, inplace=True)
# Drop nulls from the dataframetitanic_df.dropna(inplace=True)
# Verifying that we dont have any more nulls. # Notice we dont see any black bars on the heatmap so all nulls have been dropped. - sns.heatmap(titanic_df.isnull(),cmap="YlGnBu")
<matplotlib.axes._subplots.AxesSubplot at 0x127870f28>
<matplotlib.axes._subplots.AxesSubplot at 0x127b8e3c8>
Most Passengers who did not survive belonged to Class 3 i.e the lowest class. Most people who survived belonged to Class 1, then Class 3 and then Class 2.
# What was the age distribution on the titanictitanic_df['Age'].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x127d4e9b0>
As we can see, the titanic was populated by more younger people of age < 30. Which means lots of children and young adults.
What were the ages of the survivors? What was the age distribution on the titanic
<matplotlib.axes._subplots.AxesSubplot at 0x127ef6b70>
Ages of 20-40 were more likely to survive, followed by Ages below 5. We would be inclined to think that children were more likelt to survive but Ages 10-20 have a lower number. We need to ask ourself if that is because there were a low number of population for ages 10-20.
(There were close to 40 passengers from ages 10-20 and close to 20 survived)
Analysing the fare distribution. How many people paid what sums of fare on the ship?
titanic_df['Fare'].hist(bins=20,color='y')
<matplotlib.axes._subplots.AxesSubplot at 0x127faa320>
<matplotlib.axes._subplots.AxesSubplot at 0x12834fe80>
Most people who survived were from Southamption, but most people who did not survive also boarded from Southamption. Its safe to say the majority of the ship came from Southamption.
How do the Age values match with different passenger class
<matplotlib.axes._subplots.AxesSubplot at 0x1284e5898>
The mean ages for class 1 are higher than class 2, which is higher than class 3. This is intuitive becasue richer people tend to be of older age. Class 1 fare is the most expensive so has a higher mean age.
Putting it together, We can plot for survived/Not Survived for Males/Females by Age Group and Fare paid for the ship.
0 - Not Survived
1 - Survived
Red circles are Male while Blue circle markers are Female
We need to convert string values into binary (0 or 1) values
# Converting the Embarked column into a numerical binary value for Q,S and C. If both Q and C are 0, # then the value would automatically be Cembarked=pd.get_dummies(titanic_df['Embarked'], drop_first='True')
# Converting the P assenger Class column into a numerical binary value for 1,2,3. If both 2 and 3 are 0, # then the value would automatically be class 1pcl=pd.get_dummies(titanic_df['Pclass'], drop_first='True')
pcl.head()
# Converting Sec column to a binary. If Male = 0, then the value would be a female automaticallysex=pd.get_dummies(titanic_df['Sex'], drop_first='True')
sex.head()
Performing Machine Learning on the dataset prepared -
# Assigning dependant and independant variables # Survived column is our dependant variable. We are trying to predict this variabley=df_binary['Survived']
# The other columns are out independant variables. Hence we will drop Survived column from the dataframeX=df_binary.drop('Survived', axis=1)
/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='warn', n_jobs=None, penalty='l2',
random_state=None, solver='warn', tol=0.0001, verbose=0,
warm_start=False)
print(f"Training Data Score: {classifier.score(X_train, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test, y_test)}")
Training Data Score: 0.7978910369068541
Testing Data Score: 0.7482517482517482
# Converting the Embarked column into a numerical binary value for Q,S and C. If both Q and C are 0, # then the value would automatically be Cembarked_test=pd.get_dummies(test_df['Embarked'], drop_first='True')
pcl_test=pd.get_dummies(test_df['Pclass'], drop_first='True')
sex_test=pd.get_dummies(test_df['Sex'], drop_first='True')
test_df=pd.concat([test_df, embarked_test, pcl_test, sex_test], axis=1)
test_df=test_df[["SibSp","Parch","Fare","Q","S",2,3,"male"]]
test_df.head()