This repository contains a data analysis project that focuses on exploring and analyzing a dataset from a bank. The dataset, stored in a CSV file named bank_data.csv
, contains various customer-related information, such as age, job, education, and financial details.
This data analysis project aims to provide insights into the bank dataset, exploring various aspects of the data such as customer demographics, financial information, and the response variable. The project includes data cleaning, handling missing values, outlier detection, and various visualizations to help understand the data better.
Before running the code in this project, make sure you have the following Python libraries installed:
- Pandas
- NumPy
- Matplotlib
- Seaborn
You can install the required Python libraries using pip:
pip install pandas numpy matplotlib seaborn
The data analysis process is broken down into several steps, as outlined below:
The project starts by importing necessary Python libraries and setting up the environment.
The dataset, stored in the 'bank_data.csv' file, is read into a Pandas DataFrame, and the first few rows are displayed to get an initial overview.
Data cleaning involves removing unwanted rows, columns, or values from the dataset to prepare it for analysis. In this project, some rows with missing or irrelevant data are dropped, and the 'jobedu' column is divided into separate 'job' and 'education' columns.
Unnecessary columns like 'customerid' are dropped to simplify the dataset.
A new Education
column is created by extracting values from the jobedu
column.
Missing values in the age
and month
columns are identified and handled appropriately. In the pdays
column, missing values are replaced with NaN.
Duplicate records based on age
and response
columns are identified.
Outliers in numerical variables like age
, salary
, and balance
are analyzed using boxplots and quantiles.
The 'duration' variable is standardized to ensure uniformity.
Univariate analysis explores categorical features like marital
, job
, education
, poutcome
, and the target variable response
. Visualizations such as bar plots and pie charts provide insights.
Bivariate analysis examines relationships between variables, including numerical-numerical, categorical-numerical, and categorical-categorical relationships. Correlation analysis, boxplots, and heatmaps are used to visualize these relationships.
This data analysis project provides a comprehensive exploration of the bank dataset, covering data cleaning, missing value handling, outlier detection, and various visualizations. The findings and insights gained from this analysis can be valuable for making informed decisions and building predictive models.
Contributions to this project are welcome. If you have suggestions, improvements, or additional analyses to add, please feel free to contribute.
This project is licensed under the MIT License - see the LICENSE.md file for details.