In this lab, you'll explore interactions in the Ames Housing dataset.
You will be able to:
- Determine if an interaction term would be useful for a specific model or set of data
- Create interaction terms out of independent variables in linear regression
- Interpret coefficients of linear regression models that contain interaction terms
Once again we will be using the Ames Housing dataset, where each record represents a home sale:
# Run this cell without changes
import pandas as pd
ames = pd.read_csv('ames.csv', index_col=0)
# Remove some outliers to make the analysis more intuitive
ames = ames[ames["GrLivArea"] < 3000]
ames = ames[ames["LotArea"] < 20_000]
ames
In particular, we'll use these numeric and categorical features:
# Run this cell without changes
numeric = ['LotArea', '1stFlrSF', 'GrLivArea']
categorical = ['KitchenQual', 'Neighborhood']
Use all of the numeric and categorical features described above. (We will call this the "baseline" model because we are making a comparison with and without an interaction term. In a complete modeling process you would start with a simpler baseline.)
One-hot encode the categorical features (dropping the first), and center (subtract the mean) from the numeric features.
# Your code here - prepare data for modeling
Using the numeric and categorical features that you have prepared, as well as SalePrice
as the target, build a StatsModels OLS model.
# Your code here - import relevant libraries and build model
Describe the adjusted R-Squared as well as which coefficients are statistically significant. For now you can skip interpreting all of the coefficients.
# Your code here - evaluate the baseline model
# Your written answer here
Answer (click to reveal)
The model overall explains about 83% of the variance in sale price.
We'll used the standard alpha of 0.05 to evaluate statistical significance:
- Coefficients for the intercept as well as all continuous variables are statistically significant
- Coefficients for
KitchenQual
are statistically significant - Coefficients for most values of
Neighborhood
are statistically significant, while some are not. In this context the reference category wasBlmngtn
, which means that neighborhoods with statistically significant coefficients differ significantly fromBlmngtn
whereas neighborhoods with coefficients that are not statistically significant do not differ significantly fromBlmngtn
Square footage of a home is often worth different amounts depending on the neighborhood. So let's see if we can improve the model by building an interaction term between GrLivArea
and one of the Neighborhood
categories.
Because there are so many neighborhoods to consider, we'll narrow it down to 2 options: Neighborhood_OldTown
or Neighborhood_NoRidge
.
First, create a plot that has:
GrLivArea
on the x-axisSalePrice
on the y-axis- A scatter plot of homes in the
OldTown
andNoRidge
neighborhoods, identified by color- Hint: you will want to call
.scatter
twice, once for each neighborhood
- Hint: you will want to call
- A line showing the fit of
GrLivArea
vs.SalePrice
for the reference neighborhood
# Your code here - import plotting library and create visualization
Looking at this plot, do either of these neighborhoods seem to have a slope that differs notably from the best fit line? If so, this is an indicator that an interaction term might be useful.
Identify what, if any, interaction terms you would create based on this information.
# Your written answer here
Answer (click to reveal)
Your plot should look something like this:
If we drew the expected slopes based on the scatter plots, they would look something like this:
The slope of the orange line looks fairly different from the slope of the gray line, indicating that an interaction term for NoRidge
might be useful.
Let's also investigate to see whether adding an interaction term between two of the numeric features would be helpful.
We'll specifically focus on interactions with LotArea
. Does the value of an extra square foot of lot area change depending on the square footage of the home? Both 1stFlrSF
and GrLivArea
are related to home square footage, so we'll use those in our comparisons.
Create two side-by-side plots:
- One scatter plot of
LotArea
vs.SalePrice
where the color of the points is based on1stFlrSF
- One scatter plot of
LotArea
vs.SalePrice
where the color of the points is based onGrLivArea
# Your code here - create two visualizations
Looking at these plots, does the slope between LotArea
and SalePrice
seem to differ based on the color of the point? If it does, that is an indicator that an interaction term might be helpful.
Describe your interpretation below:
# Your written answer here
Answer (click to reveal)
Your plots should look something like this:
For both 1stFlrSF
and GrLivArea
, it seems like a larger lot area doesn't matter very much for homes with less square footage. (In other words, the slope is closer to a flat line when the dots are lighter colored.) Then for homes with more square footage, a larger lot area seems to matter more for the sale price. (In other words, the slope is steeper when the dots are darker colored.)
This difference in slope based on color indicates that an interaction term for either/both of 1stFlrSF
and GrLivArea
with LotArea
might be helpful.
For ease of model interpretation, it probably makes the most sense to create an interaction term between LotArea
and 1stFlrSF
, since we already have an interaction that uses GrLivArea
.
Based on your analysis above, build a model based on the baseline model with one or more interaction terms added.
# Your code here - build a model with one or more interaction terms
Same as with the baseline model, describe the adjusted R-Squared and statistical significance of the coefficients.
# Your code here - evaluate the model with interactions
# Your written answer here
Answer (click to reveal)
The model overall still explains about 83% of the variance in sale price. The baseline explained 82.7% whereas this model explains 82.9%, so it's a marginal improvement.
- Coefficients for the intercept as well as all continuous variables are still statistically significant
- Coefficients for
KitchenQual
are still statistically significant Neighborhood_NoRidge
used to be statistically significant but now it is notGrLivArea x Neighborhood_NoRidge
is not statistically significantLotArea x 1stFlrSF
is statistically significant
Interpret the coefficients for the intercept as well as the interactions and all variables used in the interactions. Make sure you only interpret the coefficients that were statistically significant!
# Your written answer here
Answer (click to reveal)
The intercept is about 258k. This means that a home with average continuous attributes and reference categorical attributes (excellent kitchen quality, Bloomington Heights neighborhood) would cost about \$258k.
The coefficient for LotArea
is about 2.58. This means that for a home with average first floor square footage, each additional square foot of lot area is associated with an increase of about \$2.58 in sale price.
The coefficient for 1stFlrSF
is about 30.5. This means that for a home with average lot area, each additional square foot of first floor area is associated with an increase of about \$30.50 in sale price.
The coefficient for LotArea x 1stFlrSF
is about 0.003. This means that:
- For each additional square foot of lot area, there is an increase of about \$2.58 + (0.003 x first floor square footage) in sale price
- For each additional square foot of first floor square footage, there is an increase of about \$30.50 + (0.003 x lot area square footage) in sale price
Neighborhood_NoRidge
and GrLivArea x Neighborhood_NoRidge
were not statistically significant so we won't be interpreting their coefficients.
You should now understand how to include interaction effects in your model! As you can see, interactions that seem promising may or may not end up being statistically significant. This is why exploration and iteration are important!