Regression Trees and Model Optimization - Lab


In this lab, we'll see how to apply regression analysis using CART trees while making use of some hyperparameter tuning to improve our model.


In this lab you will:

  • Perform the full process of cleaning data, tuning hyperparameters, creating visualizations, and evaluating decision tree models
  • Determine the optimal hyperparameters for a decision tree model and evaluate the performance of decision tree models

Ames Housing dataset

The dataset is available in the file 'ames.csv'.

  • Import the dataset and examine its dimensions:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt'ggplot')
%matplotlib inline

# Load the Ames housing dataset 
data = None

# Print the dimensions of data

# Check out the info for the dataframe

# Show the first 5 rows

Identify features and target data

In this lab, we will use using 3 predictive continuous features:


  • LotArea: Lot size in square feet
  • 1stFlrSF: Size of first floor in square feet
  • GrLivArea: Above grade (ground) living area square feet


  • SalePrice', the sale price of the home, in dollars

  • Create DataFrames for the features and the target variable as shown above

  • Inspect the contents of both the features and the target variable

# Features and target data
target = None
features = None

Inspect correlations

  • Use scatter plots to show the correlation between the chosen features and the target variable
  • Comment on each scatter plot
# Your code here 

Create evaluation metrics

  • Import r2_score and mean_squared_error from sklearn.metrics
  • Create a function performance(true, predicted) to calculate and return both the R-squared score and Root Mean Squared Error (RMSE) for two equal-sized arrays for the given true and predicted values
    • Depending on your version of sklearn, in order to get the RMSE score you will need to either set squared=False or you will need to take the square root of the output of the mean_squared_error function - check out the documentation or this helpful and related StackOverflow post
    • The benefit of calculating RMSE instead of the Mean Squared Error (MSE) is that RMSE is in the same units at the target - here, this means that RMSE will be in dollars, calculating how far off in dollars our predictions are away from the actual prices for homes, on average
# Import metrics

# Define the function
def performance(y_true, y_predict):
    Calculates and returns the two performance scores between 
    true and predicted values - first R-Squared, then RMSE

    # Calculate the r2 score between 'y_true' and 'y_predict'

    # Calculate the root mean squared error between 'y_true' and 'y_predict'

    # Return the score


# Test the function
score = performance([3, -0.5, 2, 7, 4.2], [2.5, 0.0, 2.1, 7.8, 5.3])

# [0.9228556485355649, 0.6870225614927066]

Split the data into training and test sets

  • Split features and target datasets into training/test data (80/20)
  • For reproducibility, use random_state=42
from sklearn.model_selection import train_test_split 

# Split the data into training and test subsets
x_train, x_test, y_train, y_test = None

Grow a vanilla regression tree

  • Import the DecisionTreeRegressor class
  • Run a baseline model for later comparison using the datasets created above
  • Generate predictions for test dataset and calculate the performance measures using the function created above
  • Use random_state=45 for tree instance
  • Record your observations
# Import DecisionTreeRegressor

# Instantiate DecisionTreeRegressor 
# Set random_state=45
regressor = None

# Fit the model to training data

# Make predictions on the test data
y_pred = None

# Calculate performance using the performance() function 
score = None

# [0.5961521990414137, 55656.48543887347] - R2, RMSE

Hyperparameter tuning (I)

  • Find the best tree depth using depth range: 1-30
  • Run the regressor repeatedly in a for loop for each depth value
  • Use random_state=45 for reproducibility
  • Calculate RMSE and r-squared for each run
  • Plot both performance measures for all runs
  • Comment on the output
# Your code here 

Hyperparameter tuning (II)

  • Repeat the above process for min_samples_split
  • Use a range of values from 2-10 for this hyperparameter
  • Use random_state=45 for reproducibility
  • Visualize the output and comment on results as above
# Your code here 

Run the optimized model

  • Use the best values for max_depth and min_samples_split found in previous runs and run an optimized model with these values
  • Calculate the performance and comment on the output
# Your code here 

Level up (Optional)

  • How about bringing in some more features from the original dataset which may be good predictors?
  • Also, try tuning more hyperparameters like max_features to find a more optimal version of the model
# Your code here 


In this lab, we looked at applying a decision-tree-based regression analysis on the Ames Housing dataset. We saw how to train various models to find the optimal values for hyperparameters.