Dataset Used: titanic_dataset_csv(Kaggle)
- Preparation
- Examining the dataset, inspecting it's columns and attributes.
- Inspecting the dataset for further analysis.
- Consolidate the analysis in a concise packaged data dictionary.
- Perform univariate analysis
- Perform bivariate analysis
- Perform multivariate analysis
- Data Cleansing
- Perform K means clustering
(NB: Feature Engineering is not performed in this project.)
The following libraries are imported for data preparation and subsequent manipulation.
pandas
abbreviated as pd; for data analysis
numpy
abbreviated as np; for numerical analysis and manipulation on lists
matplotlib
abbreviated as plt; for data visualization
seaborn
abbreviated as sns; for statistical modeling and visualization
IPython.display
; for displaying images
sklearn.preprocessing
; for normalizing and standardizing attributes
sklearn.cluster
; for clustering and implementing K Means algorithm
Before delving into manipulating the datasets and performing various actions, the dataset needs to be thoroughly examined. Using the .head()
method, we will get an overview of the dataset, it's attributes and variables.
The index is set to the column that would uniquely identity the attributes. Index is set using .set_index(col_name)
In this step, we will examine the dataset.
The shape of the dataset is deciphered using DataFrame.shape
to identify the number of the attributes and columns, info()
method to get the information of the DataFrame and describe()
to describe the data in the DataFrame. describe()
would provide description of the count, standard deviation, max and min values in the DataFrame.
In this step, we find for null/missing values in the dataset. We generate a heatmap to identify the missing values in the columns. sns.heatmap(DataFrame.isnull())
is used for this purpose. The percentage of missing data is plotted using sns.displot(data=DataFrame.isna())
. This provides the distribution of the missing values.
The data types, missing values, count are packaged together in the data dictionary. This helps a clear and concise description of the variables. Also, the numerical and categorigal variables are segregated.
Univariate analysis is the analysis of the all the variables independently. Here, the following analysis is performed
- Getting the relative frequency of the Survived variable to get the percentage of passengers who survived and those who didn't.
countplot()
shows the count of observations for those who survived and those who didn't. A percentage composition is plotted in a pie chart usingdf['col_name'].value_counts().plot(kind='pie')
- Similarly, the same aforementioned process is followed for Pclass, Sex, SibSp, Parch, and Embarked.
- From the univariate analysis, more passengers died(61.62%) than those who survived(38.38%).
- This indicates that passengers with a lower socio-economic status (with a lower ticket bracket) were higher than the other passengers in the ship, i.e., Pclass with the lower class(3)
- There were more men(64.76%) than women(35.24%) in the ship.
- A greater percentage of passengers in the ship were with zero siblings/spouses.(68.2%) -The percentage of passengers with no parents/children is the highest. (76.1%) -The highest number of embarkations is at Southampton(72.4%), followed by Cherbourg(18.9%) and Queenstown(8.7%) having the lowest number of embarkations.
- The plotting for variables Ticket and Cabin are unintelligible and require feature engineering.
- We get the distibution of 'Fare' using
sns.distplot(df['Fare'])
; it is skewed between values 0 and 50 and doesn't follow a normal distribution. - Similarly, we get the distribution of age using
sns.distplot(df['Age'])
. It follows a normal distribution.
Bivariate analysis is performed by analysing 2 variables.
- A pivot table is created with 'Survived as the index, with variables 'Age','SibSp','Parch','Fare' as values.
- This shows, People with a younger age have a greater chance of survival
- People who paid more for the tickets have a greater chance of survival.
- People with Parents/children have a greater chance of survival
- People with a sibling/partner have a lesser chance of survival
- Pclass is split into classes based on their Survival using
df.groupby()
- Similarly, 'SibSp', 'Parch', 'Sex', 'Embarked' are also split into categories based on their Survival. Bar grpahs are plotted for analysis.
- Based on the analysis for Pclass and Survived, upper class passengers had a greater chance of survival with a 60% survival percentage, the middle class having a 45% survival percentage, while the lower class having the least percentage of survival with a meagre ~23-24%. This indicates that the chance of survival is skewed towards passengers with a greater socio-economic status.
- The passengers on board with 3 parents/children had the greatest percentage of survival(~60%), followed closely by passengers with 2 and 1 parent(s)/child(ren). Passengers with 5 parents/children had the least percentage of survival while passengers with 0 children/parent(s) relatively low percentage of survival as well.
- The female sex had a greater percentage of survival(75%) compared to male(20%).
- The port of embarkation at Cherbourg had the highest survival percentage.
- For the distribution of Sex w.r.t. Fare, it indicates female passengers purchased tickets with a higher fare.
- For the distribution of Survived w.r.t. Age, people with a younger age had a slightly greater chance of survival.
- For the distribution of PClass w.r.t. Age, upper class passengers are of older age than the rest of the passengers with middle class and lower class socio economic brackets. The lower class passengers have the youngest age bracket.
- For the distribution of Survived w.r.t. Age, passengers who paid a higher fare have a greater chance at survival.
- Pivot Tables for comparisons between Survived and PClass, Survived and Sex, Survived and Embarked are created along with the count using
aggfunc=count
to further the aforementioned analysis.
- This is performed to identify correlations between variables.
- Parch(Parents/children) and SibSp(siblings/parents) have a strong positive correlation. Furthering the analysis that families tend to stick together.
- Survived and Fare have a positive correlation, indicating fare of the ticket plays an important role in the chance for survival.
- There is a strong negative correlation between Survived and PClass, Pclass and age, age and sibsp, fare and Pclass
- To mitigate the null values in the Age column, a boxplot of Survived wrt Age is plotted to get the mean of Age. This mean value will be used to fill up the null values.
- A function is defined to get the mean ages for the both the conditions to fill up the missing values.
- Plotting a heatmap would result in the removal of all null values in the Age column.
- Since, feature engineering is not performed here, Cabin is dropped.
- Age column is moved to a new DataFrame.
- Since age is already normalized, we scale using
StandardScaler()
to fit the Age. - K Means will cluster all the datapoints to their centroids.
kmeans= KMeans(n_clusters=i)
will generate the number of clusters and the number of centroids.kmeans.intertia_
we will calculate the distance between each data point and its centroid.- Optimal number of clusters is selected using 'Within cluster sum of squares method'. Here the no. of clusters=2
- The shape of the centroid is used to form clusters.
- Then we perform inverse transfromation to get the clusters for Age
- Histograms are plotted to display the distribution for the clusters of Age.