Preprocessing with scikit-learn - Cumulative Lab

Introduction

In this cumulative lab, you'll practice applying various preprocessing techniques with scikit-learn (sklearn) to the Ames Housing dataset in order to prepare the data for predictive modeling. The main emphasis here is on preprocessing (not EDA or modeling theory), so we will skip over most of the visualization and metrics steps that you would take in an actual modeling process.

Objectives

You will be able to:

  • Practice identifying which preprocessing technique to use
  • Practice filtering down to relevant columns
  • Practice applying sklearn.impute to fill in missing values
  • Practice applying sklearn.preprocessing:
    • OrdinalEncoder for converting binary categories to 0 and 1 within a single column
    • OneHotEncoder for creating multiple "dummy" columns to represent multiple categories

Your Task: Prepare the Ames Housing Dataset for Modeling

house in Ames

Photo by Kyle Kempt on Unsplash

Requirements

1. Drop Irrelevant Columns

For the purposes of this lab, we will only be using a subset of all of the features present in the Ames Housing dataset. In this step you will drop all irrelevant columns.

2. Handle Missing Values

Often for reasons outside of a data scientist's control, datasets are missing some values. In this step you will assess the presence of NaN values in our subset of data, and use MissingIndicator and SimpleImputer from the sklearn.impute submodule to handle any missing values.

3. Convert Categorical Features into Numbers

A built-in assumption of the scikit-learn library is that all data being fed into a machine learning model is already in a numeric format, otherwise you will get a ValueError when you try to fit a model. In this step you will use an OrdinalEncoder to replace data within individual non-numeric columns with 0s and 1s, and a OneHotEncoder to replace columns containing more than 2 categories with multiple "dummy" columns containing 0s and 1s.

At this point, a scikit-learn model should be able to run without errors!

4. Preprocess Test Data

Apply Steps 1-3 to the test data in order to perform a final model evaluation.

Lab Setup

Getting the Data

In the cell below, we import the pandas library, open the CSV containing the Ames Housing data as a pandas DataFrame, and inspect its contents.

# Run this cell without changes
import pandas as pd
df = pd.read_csv("data/ames.csv")
df
# Run this cell without changes
df.describe()

The prediction target for this analysis is the sale price of the home, so we separate the data into X and y accordingly:

# Run this cell without changes
y = df["SalePrice"]
X = df.drop("SalePrice", axis=1)

Next, we separate the data into a train set and a test set prior to performing any preprocessing steps:

# Run this cell without changes
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

(If you are working through this lab and you just want to start over with the original value for X_train, re-run the cell above.)

# Run this cell without changes
print(f"X_train is a DataFrame with {X_train.shape[0]} rows and {X_train.shape[1]} columns")
print(f"y_train is a Series with {y_train.shape[0]} values")

# We always should have the same number of rows in X as values in y
assert X_train.shape[0] == y_train.shape[0]

Fitting a Model

For this lab we will be using a LinearRegression model from scikit-learn (documentation here).

Right now, we have not done any preprocessing, so we expect that trying to fit a model will fail:

# Run this cell without changes
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

As you can see, we got ValueError: could not convert string to float: 'RL'.

In order to fit a scikit-learn model, all values must be numeric, and the third column of our full dataset (MSZoning) contains values like 'RL' and 'RH', which are strings. So this error was expected, but after some preprocessing, this model will work!

1. Drop Irrelevant Columns

For the purpose of this analysis, we'll only use the following columns, described by relevant_columns. You can find the full description of their values in the file data/data_description.txt included in this repository.

In the cell below, reassign X_train so that it only contains the columns in relevant_columns.

Hint: Even though we describe this as "dropping" irrelevant columns, it's easier if you invert the logic, so that we are only keeping relevant columns, rather than using the .drop() method. It is possible to use the .drop() method if you really want to, but first you would need to create a list of the column names that you don't want to keep.

# Replace None with appropriate code

# Declare relevant columns
relevant_columns = [
    'LotFrontage',  # Linear feet of street connected to property
    'LotArea',      # Lot size in square feet
    'Street',       # Type of road access to property
    'OverallQual',  # Rates the overall material and finish of the house
    'OverallCond',  # Rates the overall condition of the house
    'YearBuilt',    # Original construction date
    'YearRemodAdd', # Remodel date (same as construction date if no remodeling or additions)
    'GrLivArea',    # Above grade (ground) living area square feet
    'FullBath',     # Full bathrooms above grade
    'BedroomAbvGr', # Bedrooms above grade (does NOT include basement bedrooms)
    'TotRmsAbvGrd', # Total rooms above grade (does not include bathrooms)
    'Fireplaces',   # Number of fireplaces
    'FireplaceQu',  # Fireplace quality
    'MoSold',       # Month Sold (MM)
    'YrSold'        # Year Sold (YYYY)
]

# Reassign X_train so that it only contains relevant columns
None

# Visually inspect X_train
X_train

Check that the new shape is correct:

# Run this cell without changes

# X_train should have the same number of rows as before
assert X_train.shape[0] == 1095

# Now X_train should only have as many columns as relevant_columns
assert X_train.shape[1] == len(relevant_columns)

2. Handle Missing Values

In the cell below, we check to see if there are any NaNs in the selected subset of data:

# Run this cell without changes
X_train.isna().sum()

Ok, it looks like we have some NaNs in LotFrontage and FireplaceQu.

Before we proceed to fill in those values, we need to ask: do these NaNs actually represent missing values, or is there some real value/category being represented by NaN?

Fireplace Quality

To start with, let's look at FireplaceQu, which means "Fireplace Quality". Why might we have NaN fireplace quality?

Well, some properties don't have fireplaces!

Let's confirm this guess with a little more analysis.

First, we know that there are 512 records with NaN fireplace quality. How many records are there with zero fireplaces?

# Run this cell without changes
X_train[X_train["Fireplaces"] == 0]

Ok, that's 512 rows, same as the number of NaN FireplaceQu records. To double-check, let's query for that combination of factors (zero fireplaces and FireplaceQu is NaN):

# Run this cell without changes
X_train[
    (X_train["Fireplaces"] == 0) &
    (X_train["FireplaceQu"].isna())
]

Looks good, still 512 records. So, NaN fireplace quality is not actually information that is missing from our dataset, it is a genuine category which means "fireplace quality is not applicable". This interpretation aligns with what we see in data/data_description.txt:

...
FireplaceQu: Fireplace quality

       Ex	Excellent - Exceptional Masonry Fireplace
       Gd	Good - Masonry Fireplace in main level
       TA	Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
       Fa	Fair - Prefabricated Fireplace in basement
       Po	Poor - Ben Franklin Stove
       NA	No Fireplace
...

So, let's replace those NaNs with the string "N/A" to indicate that this is a real category, not missing data:

# Run this cell without changes
X_train["FireplaceQu"] = X_train["FireplaceQu"].fillna("N/A")
X_train["FireplaceQu"].value_counts()

Eventually we will still need to perform some preprocessing to prepare the FireplaceQu column for modeling (because models require numeric inputs rather than inputs of type object), but we don't need to worry about filling in missing values.

Lot Frontage

Now let's look at LotFrontage — it's possible that NaN is also a genuine category here, and it's possible that it's just missing data instead. Let's apply some domain understanding to understand whether it's possible that lot frontage can be N/A just like fireplace quality can be N/A.

Lot frontage is defined as the "Linear feet of street connected to property", i.e. how much of the property runs directly along a road. The amount of frontage required for a property depends on its zoning. Let's look at the zoning of all records with NaN for LotFrontage:

# Run this cell without changes
df[df["LotFrontage"].isna()]["MSZoning"].value_counts()

So, we have RL (residential low density), RM (residential medium density), FV (floating village residential), and RH (residential high density). Looking at the building codes from the City of Ames, it appears that all of these zones require at least 24 feet of frontage.

Nevertheless, we can't assume that all properties have frontage just because the zoning regulations require it. Maybe these properties predate the regulations, or they received some kind of variance permitting them to get around the requirement. It's still not as clear here as it was with the fireplaces whether this is a genuine "not applicable" kind of NaN or a "missing information" kind of NaN.

In a case like this, we can take a double approach:

  1. Make a new column in the dataset that simply represents whether LotFrontage was originally NaN
  2. Fill in the NaN values of LotFrontage with median frontage in preparation for modeling

Missing Indicator for LotFrontage

First, we import sklearn.impute.MissingIndicator (documentation here). The goal of using a MissingIndicator is creating a new column to represent which values were NaN (or some other "missing" value) in the original dataset, in case NaN ends up being a meaningful indicator rather than a random missing bit of data.

A MissingIndicator is a scikit-learn transformer, meaning that we will use the standard steps for any scikit-learn transformer:

  1. Identify data to be transformed (typically not every column is passed to every transformer)
  2. Instantiate the transformer object
  3. Fit the transformer object (on training data only)
  4. Transform data using the transformer object
  5. Add the transformed data to the other data that was not transformed
# Replace None with appropriate code
from sklearn.impute import MissingIndicator

# (1) Identify data to be transformed
# We only want missing indicators for LotFrontage
frontage_train = X_train[["LotFrontage"]]

# (2) Instantiate the transformer object
missing_indicator = MissingIndicator()

# (3) Fit the transformer object on frontage_train
None

# (4) Transform frontage_train and assign the result
# to frontage_missing_train
frontage_missing_train = None

# Visually inspect frontage_missing_train
frontage_missing_train

The result of transforming frontage_train should be an array of arrays, each containing True or False. Make sure the asserts pass before moving on to the next step.

# Run this cell without changes
import numpy as np

# frontage_missing_train should be a NumPy array
assert type(frontage_missing_train) == np.ndarray

# We should have the same number of rows as the full X_train
assert frontage_missing_train.shape[0] == X_train.shape[0]

# But we should only have 1 column
assert frontage_missing_train.shape[1] == 1

Now let's add this new information as a new column of X_train:

# Run this cell without changes

# (5) add the transformed data to the other data
X_train["LotFrontage_Missing"] = frontage_missing_train
X_train
# Run this cell without changes

# Now we should have 1 extra column compared to
# our original subset
assert X_train.shape[1] == len(relevant_columns) + 1

Imputing Missing Values for LotFrontage

Now that we have noted where missing values were originally present, let's use a SimpleImputer (documentation here) to fill in those NaNs in the LotFrontage column.

The process is very similar to the MissingIndicator process, except that we want to replace the original LotFrontage column with the transformed version instead of just adding a new column on.

In the cell below, create and use a SimpleImputer with strategy="median" to transform the value of frontage_train (declared above).

# Replace None with appropriate code

from sklearn.impute import SimpleImputer

# (1) frontage_train was created previously, so we don't
# need to extract the relevant data again

# (2) Instantiate a SimpleImputer with strategy="median"
imputer = None

# (3) Fit the imputer on frontage_train
None

# (4) Transform frontage_train using the imputer and
# assign the result to frontage_imputed_train
frontage_imputed_train = None

# Visually inspect frontage_imputed_train
frontage_imputed_train

Now we can replace the original value of LotFrontage in X_train with the new value:

# Run this cell without changes

# (5) Replace value of LotFrontage
X_train["LotFrontage"] = frontage_imputed_train

# Visually inspect X_train
X_train

Now the shape of X_train should still be the same as before:

# Run this cell without changes
assert X_train.shape == (1095, 16)

And now our X_train no longer contains any NaN values:

# Run this cell without changes
X_train.isna().sum()

Great! Now we have completed Step 2.

3. Convert Categorical Features into Numbers

Despite dropping irrelevant columns and filling in those NaN values, if we feed the current X_train into our scikit-learn LinearRegression model, it will crash:

# Run this cell without changes
model.fit(X_train, y_train)

Now the first column to cause a problem is Street, which is documented like this:

...
Street: Type of road access to property

       Grvl	Gravel	
       Pave	Paved
...

Let's look at the full X_train:

# Run this cell without changes
X_train.info()

So, our model is crashing because some of the columns are non-numeric.

Anything that is already float64 or int64 will work with our model, but these features need to be converted:

  • Street (currently type object)
  • FireplaceQu (currently type object)
  • LotFrontage_Missing (currently type bool)

There are two main approaches to converting these values, depending on whether there are 2 values (meaning the categorical variable can be converted into a single binary number) or more than 2 values (meaning we need to create extra columns to represent all categories).

(If there is only 1 value, this is not a useful feature for the purposes of predictive analysis because every single row contains the same information.)

In the cell below, we inspect the value counts of the specified features:

# Run this cell without changes

print(X_train["Street"].value_counts())
print()
print(X_train["FireplaceQu"].value_counts())
print()
print(X_train["LotFrontage_Missing"].value_counts())

So, it looks like Street and LotFrontage_Missing have only 2 categories and can be converted into binary in place, whereas FireplaceQu has 6 categories and will need to be expanded into multiple columns.

Binary Categories

For binary categories, we will use an OrdinalEncoder (documentation here) to convert the categories of Street and LotFrontage_Missing into binary values (0s and 1s).

Just like in Step 2 when we used the MissingIndicator and SimpleImputer, we will follow these steps:

  1. Identify data to be transformed
  2. Instantiate the transformer object
  3. Fit the transformer object (on training data only)
  4. Transform data using the transformer object
  5. Add the transformed data to the other data that was not transformed

Let's start with transforming Street:

# Replace None with appropriate code

# (0) import OrdinalEncoder from sklearn.preprocessing
None

# (1) Create a variable street_train that contains the
# relevant column from X_train
# (Use double brackets [[]] to get the appropriate shape)
street_train = None

# (2) Instantiate an OrdinalEncoder
encoder_street = None

# (3) Fit the encoder on street_train
None

# Inspect the categories of the fitted encoder
encoder_street.categories_[0]

The .categories_ attribute of OrdinalEncoder is only present once the .fit method has been called. (The trailing _ indicates this convention.)

What this tells us is that when encoder_street is used to transform the street data into 1s and 0s, 0 will mean 'Grvl' (gravel) in the original data, and 1 will mean 'Pave' (paved) in the original data.

The eventual scikit-learn model only cares about the 1s and 0s, but this information can be useful for us to understand what our code is doing and help us debug when things go wrong.

Now let's transform street_train with the fitted encoder:

# Replace None with appropriate code

# (4) Transform street_train using the encoder and
# assign the result to street_encoded_train
street_encoded_train = None

# Flatten for appropriate shape
street_encoded_train = street_encoded_train.flatten()

# Visually inspect street_encoded_train
street_encoded_train

All of the values we see appear to be 1 right now, but that makes sense since there were only 4 properties with gravel (0) streets in X_train.

Now let's replace the original Street column with the encoded version:

# Replace None with appropriate code

# (5) Replace value of Street
X_train["Street"] = None

# Visually inspect X_train
X_train
# Run this cell without changes
X_train.info()

Perfect! Now Street should by type int64 instead of object.

Now, repeat the same process with LotFrontage_Missing:

# Replace None with appropriate code

# (1) We already have a variable frontage_missing_train
# from earlier, no additional step needed

# (2) Instantiate an OrdinalEncoder for missing frontage
encoder_frontage_missing = None

# (3) Fit the encoder on frontage_missing_train
None

# Inspect the categories of the fitted encoder
encoder_frontage_missing.categories_[0]
# Replace None with appropriate code

# (4) Transform frontage_missing_train using the encoder and
# assign the result to frontage_missing_encoded_train
frontage_missing_encoded_train = None

# Flatten for appropriate shape
frontage_missing_encoded_train = frontage_missing_encoded_train.flatten()

# Visually inspect frontage_missing_encoded_train
frontage_missing_encoded_train
# Replace None with appropriate code

# (5) Replace value of LotFrontage_Missing
X_train["LotFrontage_Missing"] = None

# Visually inspect X_train
X_train
# Run this cell without changes
X_train.info()

Great, now we only have 1 column remaining that isn't type float64 or int64!

Note on Preprocessing Boolean Values

For binary values like LotFrontage_Missing, you might see a few different approaches to preprocessing. Python treats True and 1 as equal:

# Run this cell without changes
print(True == 1)
print(False == 0)

This means that if your model is purely using Python, you actually might just be able to leave columns as type bool without any issues. You will likely see examples that do this. However if your model relies on C or Java "under the hood", this might cause problems.

There is also a technique using pandas rather than scikit-learn for this particular conversion of boolean values to integers:

# Run this cell without changes
df_example = pd.DataFrame(frontage_missing_train, columns=["LotFrontage_Missing"])
df_example
# Run this cell without changes
df_example["LotFrontage_Missing"] = df_example["LotFrontage_Missing"].astype(int)
df_example

This code is casting every value in the LotFrontage_Missing column to an integer, achieving the same result as the OrdinalEncoder example with less code.

The downside of using this approach is that it doesn't fit into a scikit-learn pipeline as neatly because it is using pandas to do the transformation instead of scikit-learn.

In the future, you will need to make your own determination of which strategy to use!

Multiple Categories

Unlike Street and LotFrontage_Missing, FireplaceQu has more than two categories. Therefore the process for encoding it numerically is a bit more complicated, because we will need to create multiple "dummy" columns that are each representing one category.

To do this, we can use a OneHotEncoder from sklearn.preprocessing (documentation here).

The first several steps are very similar to all of the other transformers we've used so far, although the process of combining the data with the original data differs.

In the cells below, complete steps (0)-(4) of preprocessing the FireplaceQu column using a OneHotEncoder:

# Replace None with appropriate code

# (0) import OneHotEncoder from sklearn.preprocessing
None

# (1) Create a variable fireplace_qu_train
# extracted from X_train
# (double brackets due to shape expected by OHE)
fireplace_qu_train = X_train[["FireplaceQu"]]

# (2) Instantiate a OneHotEncoder with categories="auto",
# sparse=False, and handle_unknown="ignore"
ohe = None

# (3) Fit the encoder on fireplace_qu_train
None

# Inspect the categories of the fitted encoder
ohe.categories_
# Replace None with appropriate code

# (4) Transform fireplace_qu_train using the encoder and
# assign the result to fireplace_qu_encoded_train
fireplace_qu_encoded_train = None

# Visually inspect fireplace_qu_encoded_train
fireplace_qu_encoded_train

Notice that this time, unlike with MissingIndicator, SimpleImputer, or OrdinalEncoder, we have created multiple columns of data out of a single column. The code below converts this unlabeled NumPy array into a readable pandas dataframe in preparation for merging it back with the rest of X_train:

# Run this cell without changes

# (5a) Make the transformed data into a dataframe
fireplace_qu_encoded_train = pd.DataFrame(
    # Pass in NumPy array
    fireplace_qu_encoded_train,
    # Set the column names to the categories found by OHE
    columns=ohe.categories_[0],
    # Set the index to match X_train's index
    index=X_train.index
)

# Visually inspect new dataframe
fireplace_qu_encoded_train

A couple notes on the code above:

  • The main goal of converting this into a dataframe (rather than converting X_train into a NumPy array, which would also allow them to be combined) is readability — to help you and others understand what your code is doing, and to help you debug. Eventually when you write this code as a pipeline, it will be NumPy arrays "under the hood".
  • We are using just the raw categories from FireplaceQu as our new dataframe columns, but you'll also see examples where a lambda function or list comprehension is used to create column names indicating the original column name, e.g. FireplaceQu_Ex, FireplaceQu_Fa rather than just Ex, Fa. This is a design decision based on readability — the scikit-learn model will work the same either way.
  • It is very important that the index of the new dataframe matches the index of the main X_train dataframe. Because we used train_test_split, the index of X_train is shuffled, so it goes 1023, 810, 1384 etc. instead of 0, 1, 2, etc. If you don't specify an index for the new dataframe, it will assign the first record to the index 0 rather than 1023. If you are ever merging encoded data like this and a bunch of NaNs start appearing, make sure that the indexes are lined up correctly! You also may see examples where the index of X_train has been reset, rather than specifying the index of the new dataframe — either way works.

Next, we want to drop the original FireplaceQu column containing the categorical data:

(For previous transformations we didn't need to drop anything because we were replacing 1 column with 1 new column in place, but one-hot encoding works differently.)

# Run this cell without changes

# (5b) Drop original FireplaceQu column
X_train.drop("FireplaceQu", axis=1, inplace=True)

# Visually inspect X_train
X_train

Finally, we want to concatenate the new dataframe together with the original X_train:

# Run this cell without changes

# (5c) Concatenate the new dataframe with current X_train
X_train = pd.concat([X_train, fireplace_qu_encoded_train], axis=1)

# Visually inspect X_train
X_train
# Run this cell without changes
X_train.info()

Ok, everything is numeric now! We have completed the minimum necessary preprocessing to use these features in a scikit-learn model!

# Run this cell without changes
model.fit(X_train, y_train)

Great, no error this time.

Let's use cross validation to take a look at the model's performance:

# Run this cell without changes
from sklearn.model_selection import cross_val_score

cross_val_score(model, X_train, y_train, cv=3)

Not terrible, we are explaining between 66% and 80% of the variance in the target with our current feature set. Let's say that this is our final model and move on to preparing the test data.

4. Preprocess Test Data

Apply Steps 1-3 to the test data in order to perform a final model evaluation.

This part is done for you, and it should work automatically, assuming you didn't change the names of any of the transformer objects. Note that we are intentionally not instantiating or fitting the transformers here, because you always want to fit transformers on the training data only.

Step 1: Drop Irrelevant Columns

# Run this cell without changes
X_test = X_test.loc[:, relevant_columns]

Step 2: Handle Missing Values

# Run this cell without changes

# Replace FireplaceQu NaNs with "N/A"s
X_test["FireplaceQu"] = X_test["FireplaceQu"].fillna("N/A")

# Add missing indicator for lot frontage
frontage_test = X_test[["LotFrontage"]]
frontage_missing_test = missing_indicator.transform(frontage_test)
X_test["LotFrontage_Missing"] = frontage_missing_test

# Impute missing lot frontage values
frontage_imputed_test = imputer.transform(frontage_test)
X_test["LotFrontage"] = frontage_imputed_test

# Check that there are no more missing values
X_test.isna().sum()

Step 3: Convert Categorical Features into Numbers

# Run this cell without changes

# Encode street type
street_test = X_test[["Street"]]
street_encoded_test = encoder_street.transform(street_test).flatten()
X_test["Street"] = street_encoded_test

# Encode frontage missing
frontage_missing_test = X_test[["LotFrontage_Missing"]]
frontage_missing_encoded_test = encoder_frontage_missing.transform(frontage_missing_test).flatten()
X_test["LotFrontage_Missing"] = frontage_missing_encoded_test

# One-hot encode fireplace quality
fireplace_qu_test = X_test[["FireplaceQu"]]
fireplace_qu_encoded_test = ohe.transform(fireplace_qu_test)
fireplace_qu_encoded_test = pd.DataFrame(
    fireplace_qu_encoded_test,
    columns=ohe.categories_[0],
    index=X_test.index
)
X_test.drop("FireplaceQu", axis=1, inplace=True)
X_test = pd.concat([X_test, fireplace_qu_encoded_test], axis=1)

# Visually inspect X_test
X_test

Fit the model on the full training set, evaluate on test set:

# Run this cell without changes
model.fit(X_train, y_train)
model.score(X_test, y_test)

Great, that worked! Now we have completed the full process of preprocessing the Ames Housing data in preparation for machine learning!

Summary

In this cumulative lab, you used various techniques to prepare the Ames Housing data for modeling. You filtered down the full dataset to only relevant columns, filled in missing values, and converted categorical data into numeric data. Each time, you practiced the scikit-learn transformer workflow by instantiating the transformer, fitting on the relevant training data, transforming the training data, and transforming the test data at the end (without re-instantiating or re-fitting the transformer object).