In this lab, you will use the titanic dataset to see the impact of tree pruning and hyperparameter tuning on the predictive performance of a decision tree classifier. Pruning reduces the size of decision trees by removing nodes of the tree that do not provide much predictive power to classify instances. Decision trees are the most susceptible out of all the machine learning algorithms to overfitting and effective pruning can reduce this likelihood.
In this lab you will:
- Determine the optimal hyperparameters for a decision tree model and evaluate the model performance
Let's first import the libraries you'll need for this lab.
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_curve, auc
plt.style.use('seaborn')
The titanic dataset, available in 'titanic.csv'
, is all cleaned up and preprocessed for you so that you can focus on pruning and optimization. Import the dataset and print the first five rows of the data:
# Import the data
df = None
- Assign the
'Survived'
column toy
- Drop the
'Survived'
and'PassengerId'
columns fromdf
, and assign the resulting DataFrame toX
- Split
X
andy
into training and test sets. Assign 30% to the test set and set therandom_state
toSEED
# Create X and y
y = None
X = None
# Split into training and test sets
SEED = 1
X_train, X_test, y_train, y_test = None
Note: The term "vanilla" is used for a machine learning algorithm with its default settings (no tweaking/tuning).
- Instantiate a decision tree
- Use the
'entropy'
criterion and set therandom_state
toSEED
- Use the
- Fit this classifier to the training data
# Train the classifier using training data
dt = None
- Create a set of predictions using the test set
- Using
y_test
andy_pred
, calculate the AUC (Area under the curve) to check the predictive performance
# Make predictions using test set
y_pred = None
# Check the AUC of predictions
false_positive_rate, true_positive_rate, thresholds = None
roc_auc = None
roc_auc
Let's first check for the best depth parameter for our decision tree:
- Create an array for
max_depth
values ranging from 1 - 32 - In a loop, train the classifier for each depth value (32 runs)
- Calculate the training and test AUC for each run
- Plot a graph to show under/overfitting and the optimal value
- Interpret the results
# Identify the optimal tree depth for given data
# Your observations here
Now check for the best min_samples_splits
parameter for our decision tree
- Create an array for
min_sample_splits
values ranging from 0.1 - 1 with an increment of 0.1 - In a loop, train the classifier for each
min_samples_splits
value (10 runs) - Calculate the training and test AUC for each run
- Plot a graph to show under/overfitting and the optimal value
- Interpret the results
# Identify the optimal min-samples-split for given data
# Your observations here
Now check for the best min_samples_leafs
parameter value for our decision tree
- Create an array for
min_samples_leafs
values ranging from 0.1 - 0.5 with an increment of 0.1 - In a loop, train the classifier for each
min_samples_leafs
value (5 runs) - Calculate the training and test AUC for each run
- Plot a graph to show under/overfitting and the optimal value
- Interpret the results
# Calculate the optimal value for minimum sample leafs
# Your observations here
Now check for the best max_features
parameter value for our decision tree
- Create an array for
max_features
values ranging from 1 - 12 (1 feature vs all) - In a loop, train the classifier for each
max_features
value (12 runs) - Calculate the training and test AUC for each run
- Plot a graph to show under/overfitting and the optimal value
- Interpret the results
# Find the best value for optimal maximum feature size
# Your observations here
Now we will use the best values from each training phase above and feed it back to our classifier. Then we can see if there is any improvement in predictive performance.
- Train the classifier with the optimal values identified
- Compare the AUC of the new model with the earlier vanilla decision tree AUC
- Interpret the results of the comparison
# Train a classifier with optimal values identified above
dt = None
false_positive_rate, true_positive_rate, thresholds = None
roc_auc = None
roc_auc
# Your observations here
In the next section, we shall talk about hyperparameter tuning using a technique called "grid-search" to make this process even more granular and decisive.
In this lesson, we looked at tuning a decision tree classifier in order to avoid overfitting and increasing the generalization capabilities of the classifier. For the titanic dataset, we see that identifying optimal parameter values can result in some improvements towards predictions. This idea will be exploited further in upcoming lessons and labs.