This project involves the Exploratory Data Analysis (EDA) of an airline dataset. The dataset contains various details about passengers, flights, airports, and flight statuses. Below is a detailed walkthrough of the analysis performed.
- Importing Libraries
- Loading the Dataset
- Initial Data Exploration
- Data Cleaning
- Exploratory Data Analysis
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import plotly.express as px
import folium
from folium.plugins import HeatMap
import warnings
warnings.filterwarnings('ignore')
## Loading the Dataset
```python
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
df = pd.read_csv('/kaggle/input/airline-dataset/Airline Dataset Updated - v2.csv')
df.head(10)
- The dataset contains 98,619 entries and 15 columns.
- Checking the datatypes and null values:
df.info()
df.isnull().sum()
- Dropping irrelevant columns:
First Name
,Last Name
,Passenger ID
df = df.drop(['First Name', 'Last Name', 'Passenger ID'], axis=1)
df.head(5)
- Unique values in
Gender
column:
df['Gender'].unique()
- Count of each gender:
data = df['Gender'].value_counts().reset_index()
- Visualization:
fig = px.bar(data, x='index', y='Gender', color='index', color_discrete_sequence=px.colors.sequential.Agsunset, template='plotly_dark')
fig.update_layout(title_text='Number of Males & Females', xaxis_title='GENDER', yaxis_title='COUNT')
fig.show()
- KDE plot for age distribution:
from seaborn import kdeplot
kdeplot(data=df, x='Age', hue='Gender')
- Unique nationalities and their count:
df['Nationality'].unique()
df['Nationality'].nunique()
- Top 10 nationalities:
nation_count = df['Nationality'].value_counts().reset_index()
top_10_countries = nation_count.nlargest(10, 'Nationality')
px.bar(top_10_countries, x='index', y='Nationality', color='index', color_discrete_sequence=px.colors.sequential.Agsunset, template='plotly_dark')
- Lowest 10 nationalities:
lowest_10_countries = nation_count.nsmallest(10, 'Nationality')
px.bar(lowest_10_countries, x='index', y='Nationality', color='index', color_discrete_sequence=px.colors.sequential.Agsunset, template='plotly_dark')
- Unique airports and their count:
df['Airport Name'].unique()
df['Airport Name'].nunique()
- Top 10 airports with the highest number of passengers:
airport_name = df['Airport Name'].value_counts().reset_index()
top10_airport = airport_name.nlargest(10, 'Airport Name')
px.bar(top10_airport, x='Airport Name', y='count', color='Airport Name', color_discrete_sequence=px.colors.sequential.Agsunset, template='plotly_dark')
- Top 10 airports with the lowest number of passengers:
bottom10_airport = airport_name.nsmallest(10, 'Airport Name')
px.bar(bottom10_airport, x='Airport Name', y='count', color='Airport Name', color_discrete_sequence=px.colors.sequential.Agsunset, template='plotly_dark')
This exploratory data analysis provides insights into the demographics and travel patterns of airline passengers. The visualizations help in understanding the distribution of genders, ages, nationalities, and the most and least frequented airports.