Hyperparameter Tuning and Pruning in Decision Trees - Lab

Introduction

In this lab, you will use the titanic dataset to see the impact of tree pruning and hyperparameter tuning on the predictive performance of a decision tree classifier. Pruning reduces the size of decision trees by removing nodes of the tree that do not provide much predictive power to classify instances. Decision trees are the most susceptible out of all the machine learning algorithms to overfitting and effective pruning can reduce this likelihood.

Objectives

In this lab you will:

Determine the optimal hyperparameters for a decision tree model and evaluate the model performance

Import necessary libraries

Let's first import the libraries you'll need for this lab.

import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_curve, auc
plt.style.use('seaborn')

Import the data

The titanic dataset, available in 'titanic.csv', is all cleaned up and preprocessed for you so that you can focus on pruning and optimization. Import the dataset and print the first five rows of the data:

# Import the data
df = None

Create training and test sets

Assign the 'Survived' column to y
Drop the 'Survived' and 'PassengerId' columns from df, and assign the resulting DataFrame to X
Split X and y into training and test sets. Assign 30% to the test set and set the random_state to SEED

# Create X and y 
y = None
X = None

# Split into training and test sets
SEED = 1
X_train, X_test, y_train, y_test = None

Train a vanilla classifier

Note: The term "vanilla" is used for a machine learning algorithm with its default settings (no tweaking/tuning).

Instantiate a decision tree
- Use the 'entropy' criterion and set the random_state to SEED
Fit this classifier to the training data

# Train the classifier using training data
dt = None

Make predictions

Create a set of predictions using the test set
Using y_test and y_pred, calculate the AUC (Area under the curve) to check the predictive performance

# Make predictions using test set 
y_pred = None

# Check the AUC of predictions
false_positive_rate, true_positive_rate, thresholds = None
roc_auc = None
roc_auc

Maximum Tree Depth

Let's first check for the best depth parameter for our decision tree:

Create an array for max_depth values ranging from 1 - 32
In a loop, train the classifier for each depth value (32 runs)
Calculate the training and test AUC for each run
Plot a graph to show under/overfitting and the optimal value
Interpret the results

# Identify the optimal tree depth for given data

# Your observations here

Minimum Sample Split

Now check for the best min_samples_splits parameter for our decision tree

Create an array for min_sample_splits values ranging from 0.1 - 1 with an increment of 0.1
In a loop, train the classifier for each min_samples_splits value (10 runs)
Calculate the training and test AUC for each run
Plot a graph to show under/overfitting and the optimal value
Interpret the results

# Identify the optimal min-samples-split for given data

# Your observations here

Minimum Sample Leafs

Now check for the best min_samples_leafs parameter value for our decision tree

Create an array for min_samples_leafs values ranging from 0.1 - 0.5 with an increment of 0.1
In a loop, train the classifier for each min_samples_leafs value (5 runs)
Calculate the training and test AUC for each run
Plot a graph to show under/overfitting and the optimal value
Interpret the results

# Calculate the optimal value for minimum sample leafs

# Your observations here

Maximum Features

Now check for the best max_features parameter value for our decision tree

Create an array for max_features values ranging from 1 - 12 (1 feature vs all)
In a loop, train the classifier for each max_features value (12 runs)
Calculate the training and test AUC for each run
Plot a graph to show under/overfitting and the optimal value
Interpret the results

# Find the best value for optimal maximum feature size

# Your observations here

Re-train the classifier with chosen values

Now we will use the best values from each training phase above and feed it back to our classifier. Then we can see if there is any improvement in predictive performance.

Train the classifier with the optimal values identified
Compare the AUC of the new model with the earlier vanilla decision tree AUC
Interpret the results of the comparison

# Train a classifier with optimal values identified above
dt = None


false_positive_rate, true_positive_rate, thresholds = None
roc_auc = None
roc_auc

# Your observations here

In the next section, we shall talk about hyperparameter tuning using a technique called "grid-search" to make this process even more granular and decisive.

Summary

In this lesson, we looked at tuning a decision tree classifier in order to avoid overfitting and increasing the generalization capabilities of the classifier. For the titanic dataset, we see that identifying optimal parameter values can result in some improvements towards predictions. This idea will be exploited further in upcoming lessons and labs.

srobz/dsc-tuning-decision-trees-lab-online-ds-sp-000