In this lab, you'll practice your knowledge on Ridge and Lasso regression!
You will be able to:
- Use Lasso and ridge regression in Python
- Compare Lasso and Ridge with standard regression
Let's look at yet another house pricing data set.
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('Housing_Prices/train.csv')
Look at df.info
# Your code here
We'll make a first selection of the data by removing some of the data with dtype = object
, this way our first model only contains continuous features
Make sure to remove the SalesPrice column from the predictors (which you store in X
), then replace missing inputs by the median per feature.
Store the target in y
.
# Load necessary packages
# remove "object"-type features and SalesPrice from `X`
# Impute null values
# Create y
Look at the information of X
again
Compute the R squared and the MSE for both train and test set.
from sklearn.metrics import mean_squared_error, mean_squared_log_error
# Split in train and test
# Fit the model and print R2 and MSE for train and test
We haven't normalized our data, let's create a new model that uses preprocessing.scale
to scale our predictors!
from sklearn import preprocessing
# Scale the data and perform train test split
Perform the same linear regression on this data and print out R-squared and MSE.
# Your code here
We haven't included dummy variables so far: let's use our "object" variables again and create dummies
# Create X_cat which contains only the categorical variables
# Make dummies
Merge x_cat
together with our scaled X
so you have one big predictor dataframe.
# Your code here
Perform the same linear regression on this data and print out R-squared and MSE.
# Your code here
Notice the severe overfitting above; our training R squared is quite high, but the testing R squared is negative! Our predictions are far far off. Similarly, the scale of the Testing MSE is orders of magnitude higher then that of the training.
Use all the data (normalized features and dummy categorical variables) and perform Lasso and Ridge regression for both! Each time, look at R-squared and MSE.
With default parameter (alpha = 1)
# Your code here
With a higher regularization parameter (alpha = 10)
# Your code here
With default parameter (alpha = 1)
# Your code here
With default parameter (alpha = 10)
# Your code here
Conclusions here
# number of Ridge params almost zero
# number of Lasso params almost zero
Compare with the total length of the parameter space and draw conclusions!
# your code here
Great! You now know how to perform Lasso and Ridge regression.