/Python_EDA

This topic explains about the implementation of exploratory data analysis (EDA). A total of 21 EDA case studies have been implemented using the Malaysian dataset.

Primary LanguageJupyter Notebook

Stars Badge Forks Badge Pull Requests Badge Issues Badge GitHub contributors Visitors

The webinar titled "Exploratory Data Analysis Techniques in Python" is designed to provide participants with a comprehensive understanding of how to effectively analyze data using Python. Scheduled for Sunday, July 21, 2024, from 03:00 pm to 05:00 pm (MYT), this online event is organized by ISS-Nigeria and UTMI. The session will be led by Assoc Prof Dr. Shahizan Othman, a renowned expert in data analysis, who will guide attendees through various techniques and methodologies essential for exploratory data analysis.

Participants will learn practical applications and best practices for using Python libraries such as pandas and Jupyter Notebooks to manipulate, clean, and visualize data. This webinar is ideal for both beginners and experienced professionals looking to enhance their data analysis skills. With a focus on hands-on learning, attendees will gain valuable insights that can be applied to real-world data science projects.

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is a crucial step in the data analysis process that involves examining and summarizing a dataset to understand its characteristics, identify patterns, and gain insights into the data. EDA is typically performed before more advanced statistical and machine learning techniques are applied and helps in forming hypotheses, selecting appropriate modeling approaches, and ensuring data quality. Here are some key components and techniques used in EDA:

Exploratory Data Analysis (EDA) is a crucial step in the data analysis process that involves examining and summarizing a dataset to understand its characteristics, identify patterns, and gain insights into the data. EDA is typically performed before more advanced statistical and machine learning techniques are applied and helps in forming hypotheses, selecting appropriate modeling approaches, and ensuring data quality. Here are some key components and techniques used in EDA:

  1. Data Summary: Begin by understanding the basic information about the dataset, such as the number of rows and columns, data types, missing values, and summary statistics (mean, median, standard deviation, etc.).

  2. Data Visualization: Visualizing data through plots and charts can provide a clearer understanding of its distribution and patterns. Common types of visualizations include histograms, box plots, scatter plots, and bar charts.

  3. Data Distribution: Analyze the distribution of variables to determine whether they follow normal, uniform, or other types of distributions. This can impact the choice of statistical tests and modeling techniques.

  4. Correlation Analysis: Explore the relationships between variables using correlation matrices, scatter plots, and other correlation measures. This helps identify potential dependencies and multicollinearity.

  5. Outlier Detection: Identify and handle outliers in the data. Outliers can significantly affect statistical measures and model performance.

  6. Categorical Variables: Examine the distribution of categorical variables through frequency tables, bar plots, and pie charts. This helps understand the composition of categorical data.

  7. Data Transformation: Apply transformations (e.g., log transformation, standardization) to make the data more suitable for analysis, especially if it doesn't meet assumptions of statistical methods.

  8. Feature Engineering: Create new variables or features that might be more informative or relevant for the analysis. This could involve aggregating, combining, or extracting information from existing variables.

  9. Missing Data Handling: Deal with missing data, either by imputing missing values or excluding incomplete records. The choice of method depends on the nature of the data and the problem at hand.

  10. Hypothesis Testing: If relevant, perform hypothesis tests to determine whether observed differences or relationships in the data are statistically significant.

  11. Data Transformation: Consider scaling or encoding categorical variables for modeling. This can include one-hot encoding, label encoding, or other techniques.

  12. Dimensionality Reduction: Use techniques like Principal Component Analysis (PCA) or t-Distributed Stochastic Neighbor Embedding (t-SNE) to reduce the dimensionality of the data while preserving important information.

  13. Time Series Analysis: For time series data, analyze trends, seasonality, and autocorrelation patterns. Techniques like autocorrelation plots and decomposition can be helpful.

  14. Geospatial Analysis: When dealing with geographic data, use maps, geospatial plots, and spatial statistics to understand spatial patterns and relationships.

  15. Text Analysis: If the dataset contains text data, perform text mining and sentiment analysis to extract insights from the textual content.

EDA is an iterative process, and the specific techniques and tools used can vary depending on the nature of the data and the objectives of the analysis. It plays a crucial role in gaining an initial understanding of the data, guiding subsequent analysis, and making informed decisions about the next steps in a data science or analytical project.

Why is EDA so important in data science?

✅️ The main purpose of EDA is to help you look at the data before making any assumptions. In addition to better understanding the patterns in the data or detecting unusual events, it also helps you find interesting relationships between variables.

✅️ Data scientists can use exploratory analysis to ensure that the results they produce are valid and relevant to desired business outcomes and goals.

✅️ EDA also helps stakeholders by verifying that they are asking the right questions.

✅️ EDA can help to answer questions about standard deviations, categorical variables, and confidence intervals.

✅️ After the exploratory analysis is completed and the predictions are determined, its features can be used for more complex data analysis or modeling, including machine learning.

Python

👉 Python is a popular programming language for data science and has several libraries and tools that are commonly used for EDA such as:

  1. Pandas: a library for data manipulation and analysis.
  2. Numpy: a library for numerical computing in Python.
  3. Scikit-learn: Scikit-learn is a machine learning library, but it also includes tools for data preprocessing, feature selection, and dimensionality reduction, which are essential for EDA.
  4. Matplotlib: a plotting library for creating visualizations.
  5. Seaborn: a library based on matplotlib for creating visualizations with a higher-level interface.
  6. Plotly: an interactive data visualization library.

In EDA, you might perform tasks such as cleaning the data, handling missing values, transforming variables, generating summary statistics, creating visualizations (e.g. histograms, scatter plots, box plots), and identifying outliers. All of these tasks can be done using the above libraries in Python.

📖 Notes

Basic Concept

Exercise

Exercise Objective Description
1. Introduction to Google Colab Familiarize yourself with Google Colab Create a new notebook, write a simple Python script to print "Hello, World!", and explore basic features like adding text cells, running code cells, and saving your notebook.
2. Loading Data with Pandas Learn how to load datasets into pandas DataFrames Download a sample dataset (e.g., Titanic dataset from Kaggle), upload it to Google Colab, and load it into a pandas DataFrame. Display the first few rows using the head() method.
3. Data Cleaning and Preprocessing Understand how to clean and preprocess data Identify and handle missing values in the dataset. Use methods like dropna() to remove missing values or fillna() to fill them with appropriate values. Convert data types if necessary.
4. EDA - Descriptive Statistics Perform basic descriptive statistics to understand the dataset Use pandas methods like describe(), mean(), median(), and std() to calculate summary statistics for numerical columns. Create frequency tables for categorical columns.
5. Data Visualization with Matplotlib and Seaborn Visualize data to uncover patterns and insights Create various plots such as histograms, box plots, and scatter plots using Matplotlib and Seaborn. For example, visualize the distribution of ages in the Titanic dataset and explore relationships between different features.
6. Correlation Analysis Analyze correlations between different features Calculate the correlation matrix using the corr() method in pandas. Visualize the correlation matrix using a heatmap in Seaborn to identify strongly correlated features.
7. Identifying Outliers Detect and handle outliers in the dataset Use statistical methods and visualizations like box plots to identify outliers in numerical data. Explore techniques to handle outliers, such as removing them or transforming the data.
8. Exercise: Marketing Complete All Steps with Marketing dataset.
9. Exercise: Titanic Dataset Complete all steps with the Titanic dataset.

🌟 Case Study

Team Title Colab GitHub
404 Error Property in Kuala Lumpur Open in Colab Open in GitHub
Alrite The Exportation of Plantation in Sarawak Open in Colab Open in GitHub
BEFE Covid-19 Clusters in Malaysia Open in Colab Open in GitHub
Boboiboy Property Listings in Kuala Lumpur Open in Colab Open in GitHub
COLBY Malaysia GE-14 Result Open in Colab Open in GitHub
FANTOM Daily recorded COVID-19 cases at state level In Malaysia Open in Colab Open in GitHub
HAHA Foreign Direct Investment In Malaysia Open in Colab Open in GitHub
HD Guna Tanah Tampin 2021 Open in Colab Open in GitHub
KIA Malaysia State Election 2018 Open in Colab Open in GitHub
LAB Malaysia Air Pollution Analysis Open in Colab Open in GitHub
MAAM Malaysia Hospital Patient Movement Analysis Open in Colab Open in GitHub
MEOW Capacity and utilisation of Intensive Care Unit (ICU) beds during COVID-19 Open in Colab Open in GitHub
MM Malaysia's 14th State Election Result Open in Colab Open in GitHub
PIXALATED Number of deaths in Malaysia from 2001 to 2018 Open in Colab Open in GitHub
POTATO Death by state, sex and age group Malaysia 2001-2018 Open in Colab Open in GitHub
QnX Real Estate Kuala Lumpur Malaysia Open in Colab Open in GitHub
SAMVERSE Restaurant Rating in Malaysia Open in Colab Open in GitHub
SMOL Population in Malaysia from 2010-2019 Open in Colab Open in GitHub
SQ Number of Cases and Incidents Rate of Communicable Disease by State Open in Colab Open in GitHub
TUK Number of Government School Pupils by District Education Office and State 2017-2018 Open in Colab Open in GitHub
UWU Property Listings in Kuala Lumpur Open in Colab Open in GitHub

Automated EDA Tools

EDA is a vital but time-consuming task in a data project. Here are 10 open-source tools that generate an EDA report in seconds.

Library Description Web Github
SweetViz - In-depth EDA report in two lines of code.
- Covers information about missing values, data statistics, etc.
- Creates a variety of data visualizations.
- Integrates with Jupyter Notebook.
🌐 :octocat:
Pandas-Profiling - Generate a high-level EDA report of your data in no time.
- Covers info about missing values, data statistics, correlation etc.
- Produces data alerts.
- Plots data feature interactions.
🌐 :octocat:
DataPrep - Supports Pandas and Dask DataFrames.
- Interactive Visualizations.
- 10x Faster than Pandas based tools.
- Covers info about missing values, data statistics, correlation etc.
- Plots data feature interactions.
🌐 :octocat:
AutoViz - Supports CSV, TXT, and JSON.
- Interactive Bokeh charts.
- Covers info about missing values, data statistics, correlation etc.
- Presents data cleaning suggestions.
🌐 :octocat:
D-Tale - Runs common Pandas operation with no-code.
- Exports code of analysis.
- Covers info about missing values, data statistics, correlation etc.
- Highlights duplicates, outliers, etc.
- Integrates with Jupyter Notebook.
🌐 :octocat:
dabl - Primarily provides visualizations.
- Covers wide range of plots: Scatter pair plots. Histograms.
- Target distribution.
🌐 :octocat:
QuickDA - Get overview report of dataset.
- Covers info about missing values, data statistics, correlation etc.
- Produces data alerts.
- Plots data feature interactions.
🌐 :octocat:
Datatile - Extends Pandas describe().
- Provides column stats: column type count, missing, column datatype.
- Mostly statistical information.
🌐 :octocat:
Lux - Provides visualization recommendations.
- Supports EDA on a subset of columns.
- Integrates with Jupyter Notebook.
- Exports code of analysis.
🌐 :octocat:
ExploriPy - Performs statistical testing.
- Column type-wise distribution: Continuous, Categorical
- Covers info about missing values, data statistics, correlation etc.
🌐 :octocat:

Contribution 🛠️

Please create an Issue for any improvements, suggestions or errors in the content.

You can also contact me using Linkedin for any other queries or feedback.

Visitors