In this lab, you'll practice your knowledge on adding polynomial terms to your regression model!
You will be able to:
- Determine if polynomial regression would be useful for a specific model or set of data
- Create polynomial terms out of independent variables in linear regression
For this lab you'll be using some generated data:
# Run this cell without changes
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv('sample_data.csv')
df.head()
Let's check out a scatter plot of x
vs. y
:
# Run this cell without changes
df.plot.scatter(x="x", y="y");
You will notice that the data is clearly of non-linear shape. Begin to think about what degree polynomial you believe will fit it best.
You will fit several different models with different polynomial degrees, then plot them in the same plot at the end.
# Your code here - import StatsModels and separate the data into X and y
This model should include a constant, x
, and x
squared. You can use pandas
or PolynomialFeatures
to create the squared term.
# Your code here - prepare quadratic data and fit a model
# Your code here - evaluate (adjusted) R-Squared and coefficient p-values
# Your written answer here - summarize findings
Answer (click to reveal)
This is not a good model. Because we have multiple terms and are explaining so little of the variance in y
, we actually have a negative adjusted R-Squared.
None of the coefficients are statistically significant at an alpha of 0.05
In other words, the model should include
At this point we recommend importing and using PolynomialFeatures
if you haven't already!
# Your code here - prepare 4th degree polynomial data and fit a model
# Your code here - evaluate (adjusted) R-Squared and coefficient p-values
# Your written answer here - summarize findings
Answer (click to reveal)
This is much better. We are explaining 57-58% of the variance in the target and all of our coefficients are statistically significant at an alpha of 0.05.
This model should include
# Your code here - prepare 8th degree polynomial data and fit a model
# Your code here - evaluate (adjusted) R-Squared and coefficient p-values
# Your written answer here - summarize findings
Answer (click to reveal)
Our R-Squared is higher, but none of the coefficients are statistically significant at an alpha of 0.05 any more. If what we care about is an inferential understanding of the data, this is too high a degree of the polynomial.
Build a single plot that shows the raw data as a scatter plot, as well as all of the models you have developed as line graphs. Make sure that everything is labeled so you can tell the different models apart!
# Your code here
Based on the metrics as well as the graphs, which model do you think is the best? Why?
# Your written answer here
Answer (click to reveal)
The quadratic model (polynomial degree 2) is definitely not the best based on all of the evidence we have. It has the worst R-Squared, the coefficient p-values are not significant, and you can see from the graph that there is a lot of variance in the data that it is not picking up on.
Our visual inspection aligns with the worse R-Squared for the 4th degree polynomial compared to the 8th degree polynomial. The 4th degree polynomial is flatter and doesn't seem to capture the extremes of the data as well.
However if we wanted to interpret the coefficients, then only the 4th degree polynomial has statistically significant results. The interpretation would be challenging because of the number of terms, but we could apply some calculus techniques to describe inflection points.
Overall it appears that this dataset is not particularly well suited to an inferential linear regression approach, even with polynomial transformations. So the "best" model could be either the 4th or 8th degree polynomial depending on which aspect of the model is more important to you, but either way it will be challenging to translate it into insights for stakeholders.
Great job! You now know how to include polynomials in your linear models as well as the limitations of applying polynomial regression.