In this lab, you'll explore the Ames Housing dataset for categorical variables, and you'll transform your data so you'll be able to use categorical data as predictors!
You will be able to:
- Determine whether variables are categorical or continuous
- Use one hot encoding to create dummy variables
- Describe why dummy variables are necessary
Let's start by importing the Ames Housing dataset from ames.csv
into a pandas dataframe using pandas read_csv()
# Import your data
Now look at the first five rows of ames
:
# Inspect the first few rows
Look in data_description.txt
for a full description of all variables.
A preview of some of the columns:
LotArea: Size of the lot in square feet
MSZoning: Identifies the general zoning classification of the sale.
A Agriculture
C Commercial
FV Floating Village Residential
I Industrial
RH Residential High Density
RL Residential Low Density
RP Residential Low Density Park
RM Residential Medium Density
OverallCond: Rates the overall condition of the house
10 Very Excellent
9 Excellent
8 Very Good
7 Good
6 Above Average
5 Average
4 Below Average
3 Fair
2 Poor
1 Very Poor
KitchenQual: Kitchen quality
Ex Excellent
Gd Good
TA Typical/Average
Fa Fair
Po Poor
YrSold: Year Sold (YYYY)
SalePrice: Sale price of the house in dollars
Let's inspect all features using .describe()
and .info()
# Use .describe()
# Use .info()
Now, pick 6 categorical variables and plot them against SalePrice with a bar graph for each variable. All 6 bar graphs should be on the same figure.
import matplotlib.pyplot as plt
%matplotlib inline
# Create bar plots
Create dummy variables for the six categorical features you chose remembering to drop the first. Drop the categorical columns that you used, concat the dummy columns to our continuous variables and asign it to a new variable ames_preprocessed
# Create dummy variables for your six categorical features
In this lab, you practiced your knowledge of categorical variables on the Ames Housing dataset! Specifically, you practiced distinguishing continuous and categorical data. You then created dummy variables using one hot encoding.