For this lab, we will be using the dataset in the Customer Analysis Business Case. This dataset can be found in files_for_lab
folder. In this lab we will explore categorical data. You can also continue working on the same jupyter notebook from the previous lab. However that is not necessary.
- Import the necessary libraries if you are starting a new notebook.
- Load the csv. Use the variable
customer_df
ascustomer_df = pd.read_csv()
. - What should we do with the
customer_id
column? - Load the continuous and discrete variables into
numericals_df
andcategorical_df
variables, for eg.:numerical_df = customer_df.select_dtypes() categorical_df = customer_df.select_dtypes()
- Plot every categorical variable. What can you see in the plots? Note that in the previous lab you used a bar plot to plot categorical data, with each unique category in the column on the x-axis and an appropriate measure on the y-axis. However, this time you will try a different plot. This time in each plot for the categorical variable you will have, each unique category in the column on the x-axis and the target(which is numerical) on the Y-axis
- For the categorical data, check if there is any data cleaning that need to perform.
Hint: You can use the function
value_counts()
on each of the categorical columns and check the representation of different categories in each column. Discuss if this information might in some way be used for data cleaning.