Hyperparameter Tuning and Pruning in Decision Trees - Lab

Introduction

In this lab we will use the titanic dataset to see the impact of tree pruning and hyper parameter tuning on the predictive performance of decision tree classifier. Pruning reduces the size of decision trees by removing nodes of the tree that do not provide power to classify instances. Decision trees are the most susceptible out of all the machine learning algorithms to overfitting and effective pruning can reduce this likelihood. In this lab, we shall work with the Titanic dataset and see how we can tweak different hyper parameters for optimal pruning of the trees.

Objectives

You will be able to:

  • Demonstrate how pruning is performed for decision trees
  • Understand and explain the role of different Decision tree hyperparameters
  • Select the best values for chosen hyperparameters and monitor the improvemen in performance

Import necessary libraries

Let's first import the libraries you would need for this lab.

import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_curve, auc
plt.style.use('seaborn')

Read the Titanic Dataset

In the repo, we have made titanic dataset , all cleaned up and pre-processed for you, so that you can focus on pruning and optimization. The features set is available as featues.csv and target variable as target.csv.

  • Load these files into separate dataframes below.
  • Check the shape for consistency and the view the head
# Load features and target variables

# Your code here 
(891, 13) (891, 1)
   PassengerId   Age  SibSp  Parch     Fare  Pclass_1  Pclass_2  Pclass_3  \
0            1  22.0      1      0   7.2500         0         0         1   
1            2  38.0      1      0  71.2833         1         0         0   
2            3  26.0      0      0   7.9250         0         0         1   
3            4  35.0      1      0  53.1000         1         0         0   
4            5  35.0      0      0   8.0500         0         0         1   

   Sex_female  Sex_male  Embarked_C  Embarked_Q  Embarked_S  
0           0         1           0           0           1  
1           1         0           1           0           0  
2           1         0           0           0           1  
3           1         0           0           0           1  
4           0         1           0           0           1  
   Survived
0         0
1         1
2         1
3         1
4         0

This is great. Now that we have our x (feat) and y(target), we can go ahead and make a split necessary for supervised learning.

Create a 70/30 Train/Test split

  • Using features and target variables above , create a 70/30 split using
# Create a 70/30 split for given X and y 


# Your code here 

So now we have our data ready for training, lets first train a DT classifier with this data

Train a Vanilla Classifier

Note: The term "vanilla" is used for a machine learning algorithm with its default settings (no tweaking/tuning).

  • Create a decision tree instance
  • Fit a DT classifier with training dataset using all default settings for hyperparameters i.e. we dont change any parameter.
  • Set the impurity crietria to "entropy".
# Train the classifier using training data 


# Your code here 
DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

Make predictions

  • Create a set of predictions using the test set
  • Using y_test and y_pred, calculate the AUC (Area under curve) to check the predictive performance
# Make predictions using test set 


# Your code here 


# Check the AUC of predictions


# Your code here 
0.6990740740740741

Maximum Tree Depth

Let's first check for the best depth parameter for our decision tree.

  • Create an array for for depth values ranging from 1 - 32.
  • In a loop, train the classifier for each depth value (32 runs)
  • Calculate the training and test AUC for each run
  • Plot a graph to show under/over fitting and optimal value
  • Interpret the results
# Identify the optimal tree depth for given data


# Your code here 

png

# You observations here 

Minimum Sample Split

Now check for the best min_samples_splits parameter for our decision tree.

  • Create an array for for min_sample_splits values ranging from 0.1 - 1 with an increment of 0.1
  • In a loop, train the classifier for each min_samples_splits value (10 runs)
  • Calculate the training and test AUC for each run
  • Plot a graph to show under/over fitting and optimal value
  • Interpret the results
# Identify the optimal min-samples-split for given data


# Your code here 

png

# Your observations

Minimum Sample Leafs

Now check for the best min_samples_leafs parameter value for our decision tree.

  • Create an array for for min_samples_leafs values ranging from 0.1 - 0.5 with an increment of 0.1
  • In a loop, train the classifier for each min_samples_leafs value (5 runs)
  • Calculate the training and test AUC for each run
  • Plot a graph to show under/over fitting and optimal value
  • Interpret the results
# Calculate the optimal value for minimum sample leafs

# Your code here 

png

# Your observations here 

Maximum Features

Now check for the best max_features parameter value for our decision tree.

  • Create an array for for max_features values ranging from 1 - 12 (1 features vs all)
  • In a loop, train the classifier for each max_features value (12 runs)
  • Calculate the training and test AUC for each run
  • Plot a graph to show under/over fitting and optimal value
  • Interpret the results
# Find the best value for optimal maximum feature size


# Your code here 

png

# Your observations here 

Re-train the classifier with chosen values

So now we shall use the best values from each training phase above and feed it back to our classifier and see if have any improvement in predictive performance.

  • Train the classifier with optimal values identified
  • compare the AUC with vanilla DT AUC
  • Interpret the results of comparison
# train a classifier with optimal values identified above

# Your code here 
0.7348379629629629
# You observations here 

In the next section, we shall talk about hyper-parameter tuning using a technique called "grid-search" to make this process even more granular and decisive.

Summary

In this lesson, we looked at tuning a decision tree classifier in order to avoid over fitting and increasing the generalization capabilities of the classifier. For the Titanic dataset, we see that identifying optimal parameter values can result in some improvements towards predictions. This idea will be exploited further in upcoming lessons and labs.