In this project, I discuss useful techniques to clean a messy dataset with Python and Pandas. I discuss principles of tidy data
and signs of an untidy data
. I discuss EDA and present ways to deal with outliers
and missing and negative numerical values
. I discuss how to check for missing values with ASSERT
statement. I present how to reshape data using the Pandas melt()
function.
===============================================================================
This project is divided into various sections which are listed below:-
-
Introduction to Python data cleaning
-
Tidy data format
-
Signs of an untidy dataset
-
Python data cleansing – prerequisites
-
Import the required Python libraries
-
The source dataset
-
Exploratory data analysis (EDA)
-
Visual exploratory data analysis (Visual EDA)
-
Findings of EDA and Visual EDA
-
Split the ‘age_sex’ column into two separate columns
-
Reorder the column labels
-
Dealing with negative numerical values
-
Dealing with outliers
-
Dealing with missing numerical values
-
Check with ASSERT statement
-
Reshaping the data into tidy data format
-
Summary
-
References
===============================================================================
Whenever we have to work with a real world dataset, the first problem that we face is to clean it. The real world dataset never comes clean. It consists lot of discrepancies in the dataset. So, we have to clean the dataset for further processing.
Cleaning data is the process of preparing the dataset for analysis. It is very important because the accuracy of machine learning or data mining models are affected because of poor quality of data.
So, data scientists spend a large amount of their time cleaning the dataset and transform them into a format with which they can work with. In fact, data scientists spend 80% of their time cleaning the data.
A very common scenario is that the dataset contains missing values coded as NaN
. Also, the missing values are coded in different ways. The dataset may contain negative or invalid values. It may contain outliers. It may be in the untidy format. All of these are examples of a messy dataset.
In this project, I present several useful ways to handle these discrepancies in the dataset.
===============================================================================
Data comes in a wide variety of shapes and formats. Hadley Wickham, the Chief Scientist at RStudio, write a paper about tidy data in 2014 that formalizes the shape of the data. So, it gives us a goal when formatting the data.
He states in his paper that –
Tidy data provides a standard way to organize data values within a dataset.
There are three principles of tidy data. These are as follows:-
• Columns represent separate variables
• Rows represent individual observations
• Observational units form tables.
Tidy data makes it easier to fix common data problems. So, we need to transform the untidy dataset into tidy data.
Before we look into the details of cleaning the dataset, we have to understand what constitutes an untidy data. We need to diagnose our data and find common signs of a messy dataset.
===============================================================================
We have to take a closer look to find common signs of a messy dataset. These common signs are as follows:-
• Missing numerical data
Missing numerical data needs to be identified and addressed. Either they need to be deleted or replaced with a suitable test statistic.
• Untidy data
Untidy dataset can contain multiple problems. They prevent us from transforming the messy dataset into a clean dataset that is suitable for analysis.
• Unexpected data values
Mismatched data types of a column and data values can cause potential problems. They need to be investigated and solved.
• Inconsistent column names
Column names contain inconsistent capitalizations and bad characters. They need to be addressed properly.
• Outliers
Outliers need to be detected. They pose potential problems needs to be investigated and removed.
• Duplicate rows and columns
Duplicate rows and columns make data redundant. They can bias an analysis. Hence, they needs to be found and dropped.
===============================================================================
We need three Python libraries for the data cleaning process – NumPy, Pandas and Matplotlib.
• NumPy – NumPy is the fundamental Python library for scientific computing. It adds support for large and multi-dimensional arrays and matrices. It also supports large collection of high-level mathematical functions to operate on these arrays.
• Pandas - Pandas is a software library for Python programming language which provide tools for data manipulation and analysis tasks. It will enable us to manipulate numerical tables and time series using data structures and operations.
• Matplotlib - Matplotlib is the core data visualization library of Python programming language. It provides an object-oriented API for embedding plots into applications.
===============================================================================
We have seen that we need three Python libraries – NumPy, Pandas and Matplotlib for the data cleansing process. We need to import these libraries before we actually start using them. We can import them with their usual shorthand notation as follows:-
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
===============================================================================
For this project, I have created a fictitious dataset. The dataset consists of details of my facebook friends.
===============================================================================
Now, it is time to understand the data. We should diagnose the data for any discrepancies by doing exploratory data analysis. We should proceed as follows:-
• df.shape attribute
We can check the dimensions of the data with df.shape attribute.
• df.head() and df.tail() methods
We can view the top five and bottom five rows of the dataset with df.head() and df.tail() methods respectively.
• df.info() method
We can get a concise summary of the dataset with df.info() method. This method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.
• df.dtypes attribute
We can check the data types of each column in the dataframe with df.dtypes attribute. The above command returns the data type of each column.
• df.describe() method
We can view the summary statistics of numerical columns with df.describe() method. It enable us to detect outliers in the data which require further investigation.
• df.columns attribute
We can get the column labels of the dataframe with df.columns attribute.
===============================================================================
Now, we should conduct data visualization to find discrepancies in the data. Data visualization is a great way to find errors in the data and detect outliers. They help us to detect patterns in the data.
We can use various types of plots for data visualization purpose. These plots are listed below:-
• Bar plot
A bar plot is a plot that presents data with rectangular bars with lengths proportional to the values that they represent.
• Histograms
We use histograms for plotting continuous data counts. A histogram is a representation of the distribution of data.
In this case, we use histograms for plotting distribution of data values of height (cm) and weight (kg) columns.
• Box plot
We can visualize basic summary statistics with box plot. Box plot let us to detect outliers in the data. They help us to find minimum and maximum values. They present 25th, 50th, 75th percentiles. 50th percentile value is the median value.
• Scatter plot
Scatter plot help us to explore relationship between two numeric variables. It help us to identify potentially bad data. We should draw a scatter plot of height(cm) and weight(kg) column.
===============================================================================
We can summarize the findings of EDA and visual EDA as follows:-
-
The dataset has 10 rows and 10 columns.
-
The age and sex columns are combined together. There should be two separate columns of age and sex.
-
All the invalid values (the values coded as "xx") and missing values in height (cm), weight (kg), spend_A, spend_B and spend_C columns are coded as "NaN". The use of the keyword errors='coerce' in pd.to_numeric() method enable us to convert all the invalid values into NaN.
-
The data types of columns height (cm), weight(kg) and spend_C columns are converted into float64.
-
In the height (cm) column, there is a value of 0.0. It is not possible as height cannot be 0. So, we need to resolve it.
-
In the weight (kg) column, there is a negative value of -60 and a very high value of 160. Both are invalid values. Hence, we need to solve the issue.
-
The three columns spend_A, spend_B and spend_C denote spending at three supermarkets A,B and C. These columns must contain positive real numbers. The missing values in these columns denote nothing spend in that market. We need to handle these missing values properly.
-
In the spend_B column, there is a negative value -100. The amount spent cannot be negative. So, we need to take care of that.
===============================================================================
We should split the 'age_sex' column into two separate columns.
We can do this using the df.str.split() function as follows:-
df[['age','sex']] = df.age_sex.str.split("_", expand = True)
Now, there is no need for the age_sex column. So, we should drop that column.
We can drop 'age_sex' column using the df.drop() method as follows:-
df.drop(['age_sex'], axis=1)
===============================================================================
We should reorder the columns for more pleasing visual appearance.
We can do it as follows:-
df = df[['fname','lname','age','sex','section','height(cm)','weight(kg)','spend_A','spend_B','spend_C']]
===============================================================================
We have seen that, in the weight(kg) column, there is a negative value of -60. It is invalid value because weight cannot be negative. There is a high probability that weight is 60 kg and it is mistyped as -60. So, I will replace the negative value of -60 with positive value of 60.
We can do it as follows:-
df['weight(kg)'].replace(-60, 60, inplace=True)
Similarly, in the spend_B column, there is a negative value -100. The amount spent cannot be negative. So, we need to replace this negative value of -100 with positive value of 100.
We can do it as follows:-
df['spend_B'].replace(-100,100, inplace=True)
===============================================================================
In the height(cm) column, there is a value of 0.0. It is not possible as height cannot be 0. So, we need to resolve it. I will replace the 0.0 value with the mean of the height(cm) column. It can be done as follows:-
mean = df['height(cm)'].mean()
df['height(cm)'].replace(0.0, mean, inplace=True)
In the weight(kg) column, there is a very high absurd value of 160. It is not possible to have so much weight. Hence, it is invalid value. There is a high chance that the weight is 60 kg and it is mistakenly typed as 160 kg. So, I will replace the 160 data value with 60.
It can be done as follows:-
df['weight(kg)'].replace(160.0, 60.0, inplace=True)
===============================================================================
The following commands help us to deal with missing numerical values.
-
df.isnull()
The above command checks whether each cell in a dataframe contains missing values or not. If the cell contains missing value, it returns True otherwise it returns False.
-
df.isnull.sum()
The above command returns the total number of missing values in each column in the dataset.
-
isna() and notna() functions to detect NA values
Pandas provides isna() and notna() functions to detect ‘NA’ values. These are also methods on Series and DataFrame objects.
- Examples of isna() and notna() commands
df.isna()
pd.isna(df[‘col_name’])
df[‘col_name’].notna()
Missing values are encoded in different ways. They can appear as NaN
, NA
, ?
, zero 0
, ‘xx’, minus one -1
or a blank space “ ”. We need to use various pandas methods to deal with missing values. But, pandas always recognize missing values as NaN
. So, it is essential that we should first convert all the ?
, zeros 0
, xx
, minus ones -1
or blank spaces “ ” to NaN
. If the missing values isn’t identified as NaN
, then we have to first convert or replace such non NaN
entry with a NaN
.
df[df == '?'] = np.nan
There are several methods to handle missing values. Each method has its own advantages and disadvantages. The choice of the method is subjective and depends on the nature of data and the missing values. The summary of the options available for handling missing values is given below:-
• Drop missing values with dropna() method
• Fill missing values with a test statistic
I have discussed each method in below sections:-
This is the easiest method to handle missing values. In this method, we drop labels or columns from a data set which refer to missing values.
We can drop labels or rows from a dataset containing missing values as follows:-
df.dropna (axis = 0)
We can drop columns from a dataset containing missing values as follows:-
df.dropna (axis = 1)
This is the Pandas dataframe dropna() method. An equivalent dropna() method is available for Series with same functionality.
To drop a specific column from the dataframe, we can use drop() method of Pandas dataframe.
We can drop specific col_name column from Pandas dataframe as follows:-
df.drop(‘col_name’, axis = 1)
A note about axis parameter
Axis value may contain (0 or ‘index’) or (1 or ‘columns’). Its default value is 0.
We set axis = 0 or ‘index’ to drop rows which contain missing values.
We set axis = 1 or ‘columns’ to drop columns which contain missing values.
After dropping the missing values, we can again check for missing values and the dimensions of the dataframe.
We can again check the missing values in each column as follows:-
df.isnull.sum()
We can again check the dimensions of the dataset as follows:-
df.shape
But, this method has one disadvantage. It involves the risk of losing useful information. Suppose there are lots of missing values in our dataset. If drop them, we may end up throwing away valuable information along with the missing data. It is a very grave mistake as it involves losing key information. So, it is only advised when there are only few missing values in our dataset.
So, it's better to develop an imputation strategy so that we can impute missing values with the mean or the median of the row or column containing the missing values.
In this method, we fill the missing values with a test statistic like mean, median or mode of the particular feature the missing value belongs to. One can also specify a forward-fill or back-fill to propagate the next values backward or previous value forward.
We can fill missing values with a test statistic like mean as follows:-
mean = df['col_name'].mean()
df['col_name'].fillna(value = mean, inplace = True )
df[‘col_name’].replace(to_replace = NaN, value = mean, inplace = True)
If we choose this method, then we should compute the mean value on the training set and use it to fill the missing values in the training set. Then we should save the mean value that we have computed. Later, we will replace missing values in the test set with the mean value to evaluate the system.
===============================================================================
Finally, we can check for missing values programmatically. If we drop or fill missing values, we expect no missing values. We can write an assert statement to verify this. So, we can use an assert statement to programmatically check that no missing or unexpected '0' value is present. This gives confidence that our code is running properly.
Assert statement will return nothing if the value being tested is true and will throw an AssertionError if the value is false.
Asserts
• assert 1 == 1 (return Nothing if the value is True)
• assert 1 == 2 (return AssertionError if the value is False)
assert pd.notnull(df).all().all()
assert df.column_name.notnull().all()
assert (df >=0).all().all()
assert (df['column_name']!=0).all().all()
===============================================================================
When we take a closer look at the dataframe, we can see that our dataframe is not in the tidy data format.
The columns spend_A, spend_B and spend_C contain values of amount spent rather than variables. We should reorganize our dataframe into tidy data format.
We can convert it into the tidy data format using the pd.melt() function as follows:-
pd.melt(frame=df, id_vars=['fname','lname','age','sex','section','height(cm)','weight(kg)'],
`value_vars=['spend_A','spend_B','spend_C'], var_name='expenditure',`
`value_name='amount')`
===============================================================================
In this project, I present useful techniques to clean a messy dataset with Python and Pandas.
I start with the data cleaning process and emphasize the importance of a clean dataset. Then, I discuss tidy data format and signs of a messy dataset.
Then, I diagnose the dataset with exploratory data analysis (EDA) and visual exploratory data analysis (Visual EDA). I summarize my findings of EDA and Visual EDA.
Then, I start the data cleaning process. First of all, I split the ‘age_sex’ column into two separate columns. Then, I reorder the column labels.
Then, I discuss several ways to deal with negative numerical values, outliers and missing numerical values. Then, I programmatically check with ASSERT statement that there are no missing or negative values in the dataset.
In the last step, I reshape the data into tidy data format using 'pd.melt()' function.
===============================================================================
The ideas and techniques in this project have been taken from the following books and websites:-
i. Pandas documentation
ii. Python Machine Learning by Sebastian Raschka
iii. Hands-On Machine Learning with Scikit Learn and Tensorflow by Aurélien Géron
iv. http://pandas.pydata.org/pandas-docs/stable/missing_data.html
v. https://www.datacamp.com/courses/cleaning-data-in-python
vi. https://realpython.com/python-data-cleaning-numpy-pandas/
vii. https://data-flair.training/blogs/python-data-cleansing/