Dealing with Categorical Variables - Lab

Introduction

In this lab, you'll explore the Boston Housing Data Set for categorical variables, and you'll transform your data so you'll be able to use categorical data as predictors!

Objectives

You will be able to:

  • Identify and inspect the categorical variables in the Boston housing data set
  • Learn how to categorize inputs that aren't categorical
  • Create new datasets with dummy variables

Importing the Boston Housing data set

Let's start by importing the Boston Housing data set. This data set is available in Scikit-Learn, and can be imported running the column below.

import pandas as pd
from sklearn.datasets import load_boston
boston = load_boston()

If you'll inspect Boston now, you'll see that this basically returns a dictionary. Let's have a look at what exactly is stored in the dictionary by looking at the dictionary keys

# inspect boston
# look at the keys

Let's create a Pandas DataFrame with the data (which are the features, not including the target) and the feature names as column names.

boston_features = None
#inspect the first few rows

For your reference, we copied the attribute information below. Additional information can be found here: http://scikit-learn.org/stable/datasets/index.html#boston-dataset

  • CRIM: per capita crime rate by town
  • ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
  • INDUS: proportion of non-retail business acres per town
  • CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  • NOX: nitric oxides concentration (parts per 10 million)
  • RM: average number of rooms per dwelling
  • AGE: proportion of owner-occupied units built prior to 1940
  • DIS: weighted distances to five Boston employment centres
  • RAD: index of accessibility to radial highways
  • TAX: full-value property-tax rate per $10,000
  • PTRATIO: pupil-teacher ratio by town
  • B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
  • LSTAT: % lower status of the population

Let's convert the target to a dataframe as well, and assign the column name "MEDV"

boston_target = None

#inspect the first few rows

The target is described as:

  • MEDV: Median value of owner-occupied homes in $1000’s

Next, let's merge the target and the predictors in one dataframe boston_df.

boston_df = None
boston_df.head()

Let's inspect these 13 features using .describe() and .info()

# code here
# code here

Now, take a look at the scatter plots for each predictor with the target on the y-axis.

import pandas as pd
import matplotlib.pyplot as plt

# create scatter plots

To categorical: binning

If you created your scatterplots correctly, you'll notice that except for CHAS (the Charles River Dummy variable), there is no clearly categorical data. You will have seen though that RAD and TAX have more of a vertical-looking structure like the one seen in the lesson, and that there is less of a "cloud"-looking structure compared to most other variables. It is difficult to justify a linear pattern between predictor and target here. In this situation, it might make sense to restructure data into bins so that they're treated as categorical variables. We'll start by showing how this can be done for RAD and then it's your turn to do this for TAX.

"RAD"

Look at the structure of "RAD" to decide how to create your bins.

boston_df["RAD"].describe()
# first, create bins for based on the values observed. 5 values will result in 4 bins
bins = [0, 3, 4 , 5, 24]
# use pd.cut
bins_rad = pd.cut(boston_df['RAD'], bins)
# using pd.cut returns unordered categories. Transform this to ordered categories.
bins_rad = bins_rad.cat.as_unordered()
bins_rad.head()
# inspect the result
bins_rad.value_counts().plot(kind='bar')
# replace the existing "RAD" column
boston_df["RAD"]=bins_rad

"TAX"

Split the "TAX" column up in 5 categories. You can chose the bins as desired but make sure they're pretty well-balanced.

# repeat everything for "TAX"

Perform label encoding

# perform label encoding and replace in boston_df
# inspect first few columns

Create dummy variables

Create dummy variables, and make sure their column names contain "TAX" and "RAD". Add the new dummy variables to boston_df and remove the old "RAD" and "TAX" columns.

# code goes here

Note how you end up with 21 columns now!

Summary

In this lab, you practiced your categorical variable knowledge on the Boston Housing Data Set!