In this lab, we'll explore how to use scikit-learn's GridSearchCV
class to exhaustively search through every combination hyperparameters until we find the values for a given model.
You will be able to:
- Understand and explain parameter tuning and why it is necessary
- Design and create a parameter grid for use with sklearn's GridSearchCV module
- Use GridSearchCV to increase model performance through parameter tuning
For this lab, we'll be working with the Wine Quality Dataset from the UCI Machine Learning Dataset Repository. We'll be using data about the various features of wine to predict the quality of the wine on a scale from 1-10 stars, making this a multiclass classification problem.
Before we can begin GridSearching our way to optimal hyperparameters, we'll need to go through the basic steps of modeling. This means that we'll need to:
- Import and inspect the dataset (and clean, if necessary)
- Split the data into training and testing sets
- Build and fit a baseline model that we can compare against our GridSearch results.
Run the cell below to import everything we'll need for this lab.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.metrics import accuracy_score
/anaconda3/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
Now that we've imported all the necessary libraries and frameworks for this lab, we'll need to get the dataset.
Our data is stored in the file winequality-red.csv
. Use pandas to import the data from this file and store it in a DataFrame. Print the head to ensure that everything loaded correctly.
df = None
Great! Let's inspect our data a bit. In the cell below, perform some basic Exploratory Data Analysis on our dataset. Get a feel for your data by exploring the descriptive statistics and creating at least 1 visualization to help you better understand this dataset.
Question: Based on your findings during your Eploratory Data Analysis, do you think that we need to do any sort of preprocessing on this dataset? Why or why not?
Write your answer below this line:
Now, we'll perform any necessary preprocessing on our dataset before training our model. We'll start by isolating the target variable that we are trying to predict. In the cell below:
- Store the data in the
quality
column inside thelabels
variable - Drop the
quality
column from the dataset
labels = None
labels_removed_df = None
Now that we've isolated our labels, we'll need to normalize our dataset (also referred to as scaling).
In the cell below:
- Create a
StandardScaler()
object. - Transform the data in
labels_removed_df
using the scaler object'sfit_transform()
method.
scaler = None
scaled_df = None
Normally, we would also split our data into training and testing sets. However, since we'll be making use of Cross Validation when using GridSearchCV
, we'll also want to make use of it with our baseline model to ensure that things are equal. Recall that we do not need to split our data into training and testing sets when using cross validation, since the cross validation will take care of that for us.
In the cell below:
- Create a
DecisionTreeClassifier
object. - Get the
cross_val_score
for this model, with thecv
parameter set to3
. - Calculate and print the mean cross-validation score from our model.
Note: If you need a refresher on how to use cross_val_score
, check out the documentation.
dt_clf = None
dt_cv_score = None
mean_dt_cv_score = None
# print("Mean Cross Validation Score: {:.4}%".format(mean_dt_cv_score * 100))
Take a second to interpret the results of our cross-validation score. How well did our model do? How does this compare to a naive baseline level of accuracy (random guessing)?
Write your answer below this line:
So far, our model has not have stellar performance. However, we've yet to modify the hyperparameters of the model. Each dataset is different, and the chances that the best possible parameters for a given dataset also happen to be the default parameters set by by sklearn at instantiation is very low.
This means that we need to try Hyperparameter Tuning. There are several strategies for searching for optimal hyperparameters--the one we'll be using, Combinatoric Grid Searching, is probably the most popular, because it performs an exhaustive search of all possible combinations.
The sklearn module we'll be using to accomplish this is GridSearchCV
, which can be found inside of sklearn.model_selection
.
Take a minute to look at sklearn's user guide for GridSearchCV, and then complete the following task.
In the cell below:
- Complete the
param_grid
dictionary. In this dictionary, each key represents a parameter we want to tune, whereas the corresponding value is an array of every parameter value we'd like to check for that parameter. For instance, if we would like try out the values2
,5
, and10
formin_samples_split
, ourparam_grid
dictionary would include"min_samples_split": [2, 5, 10]
. - Normally, you would have to just try different values to search through for each parameter. However, in order to limit the complexity of this lab, the parameters and values to search through have been provided for you. You just need to turn them into key-value pairs inside of the
param_grid
dictionary. Completeparam_grid
so that it tests the following values for each corresponding parameter:- For
"criterion"
, try values of"gini"
and"entropy"
. - For
"max_depth"
, tryNone
, as well as2, 3, 4, 5
and6
. - For
min_samples_split
, try2, 5
, and10
. - For
"min_samples_leaf"
, try1, 2, 3, 4, 5
and6
.
- For
dt_param_grid = {
}
Now that we have our parameter grid set up, we can create and use our GridSearchCV
object. Before we do, let's briefly think about the particulars of this model.
Grid Searching works by training a model on the data for each unique combination of parameters, and then returning the parameters of the model that performed best. In order to protect us from randomness, it is common to implement K-Fold Cross Validation during this step. For this lab, we'll set K = 3, meaning that we'll actually train 3 different models for each unique combination of parameters.
Given our param_grid
and the knowledge that we're going to use Cross Validation with a value of 3, how many different Decision Trees will our GridSearchCV
object have to train in order to try every possible combination and find the best parameter choices?
Calculate and print your answer in the cell below.
num_decision_trees = None
print("Grid Search will have to search through {} different permutations.".format(num_decision_trees))
Grid Search will have to search through None different permutations.
That's alot of Decision Trees! Decision Trees are generally pretty quick to train, but that isn't the case with every type of model we could want to tune. Be aware that if you set a particularly large search space of parameters inside your parameter grid, then Grid Searching could potentially take a very long time.
Let's create our GridSearchCV
object and fit it. In the cell below:
- Create a
GridSearchCV
object. Pass in our model, the parameter grid, andcv=3
to tell the object to use 3-Fold Cross Validation. Also pass inreturn
- Call our grid search object's
fit()
method and pass in our data and labels, just as if we were using regular cross validation.
dt_grid_search = None
#dt_grid_search.fit(None, None)
Now that we have fit our model using Grid Search, we need to inspect it to discover the optimal combination of parameters.
In the cell below:
- Calculate the the mean training score. An array of training score results can be found inside of the
.cv_results_
dictionary, with the keymean_train_score
. - Calcuate the testing score using the our grid search model's
.score()
method by passing in our data and labels. - Examine the appropriate attribute to discover the best estimator parameters found during the grid search.
HINT: If you're unsure what attribute this is stored in, take a look at sklearn's GridSearchCV Documentation.
dt_gs_training_score = None
dt_gs_testing_score = None
# print("Mean Training Score: {:.4}%".format(dt_gs_training_score * 100))
# print("Mean Testing Score: {:.4}%".format(dt_gs_testing_score * 100))
# print("Best Parameter Combination Found During Grid Search:")
# dt_grid_search.best_params_
Question: What effect, if any, did our parameter tuning have on model performance? Will GridSearchCV always discover a perfectly (global) optimal set of parameters? Why or why not?
Now that we have some experience with Grid Searching through parameter values for a Decision Tree Classifier, let's try our luck with a more advanced model and tune a Random Forest Classifier.
We'll start by repeating the same process we did for our Decision Tree Classifier, except with a Random Forest Classifier instead.
In the cell below:
- Create a
RandomForestClassifier
object. - Use Cross Validation with
cv=3
to generate a baseline score for this model type, so that we have something to compare our tuned model performance to.
rf_clf = None
mean_rf_cv_score = None
# print("Mean Cross Validation Score for Random Forest Classifier: {:.4}%".format(mean_rf_cv_score * 100))
Now that we have our baseline score, we'll create a parameter grid specific to our Random Forest Classifier.
Again--in a real world situation, you will need to decide what parameters to tune, and be very thoughtful about what values to test for each parameter. However, since this is a lab, we have provided the following table in the interest of simplicity. Complete the rf_param_grid
dictionary with the following key value pairs:
Parameter | Values |
---|---|
n_estimators | [10, 30, 100] |
criterion | ['gini', 'entropy'] |
max_depth | [None, 2, 6, 10] |
min_samples_split | [5, 10] |
min_samples_leaf | [3, 6] |
rf_param_grid = {
}
Great! Now that we have our parameter grid, we can grid search through it with our Random Forest.
In the cell below, follow the process we used with Decision Trees above to grid search for the best parameters for our Random Forest Classifier.
When creating your GridSearchCV
object, pass in:
- our Random Forest Classifier
- The parameter grid for our Random Forest Classifier
cv=3
- Do not pass in
return_train_score
as we did with our Decision Trees example above. In the interest of runtime, we'll only worry about testing accuracy this time.
NOTE: The runtime on the following cell will be over a minute on most computers.
import time
start = time.time()
rf_grid_search =None
# rf_grid_search.fit(None, None)
# print("Testing Accuracy: {:.4}%".format(rf_grid_search.best_score_ * 100))
# print("Total Runtime for Grid Search on Random Forest Classifier: {:.4} seconds".format(time.time() - start))
# print("")
# print("Optimal Parameters: {}".format(rf_grid_search.best_params_))
Did tuning the hyperparameters of our Random Forest Classifier improve model performance? Is this performance increase significant? Which model did better? If you had to choose, which model would you put into production? Explain your answer.
Write your answer below this line:
The last model we'll tune in this lab is an AdaBoost Classifier, although tuning this model will generally be similar to tuning other forms of Gradient Boosted Tree (GBT) models.
In the cell below, create an AdaBoost Classifier Object. Then, as we did with the previous two examples, fit the model using using Cross Validation to get a baseline testing accuracy so we can see how an untuned AdaBoost model performs on this task.
adaboost_clf = None
adaboost_mean_cv_score = None
# print("Mean Cross Validation Score for AdaBoost: {:.4}%".format(adaboost_mean_cv_score * 100))
Great! Now, onto creating the parameter grid for AdaBoost.
Complete the adaboost_param_grid
dictionary by adding in the following key-value pairs:
Parameters | Values |
---|---|
n_estimators | [50, 100, 250] |
learning_rate | [1.0, 0.5, 0.1] |
adaboost_param_grid = {
}
Great. Now, for the finale--use Grid Search to find optimal parameters for AdaBoost, and see how the model performs overall!
adaboost_grid_search = None
# adaboost_grid_search.fit(None, None)
# print("Testing Accuracy: {:.4}%".format(adaboost_grid_search.best_score_ * 100))
# print("Total Runtime for Grid Search on AdaBoost: {:.4} seconds".format(time.time() - start))
# print("")
# print("Optimal Parameters: {}".format(adaboost_grid_search.best_params_))
In this lab, we learned:
- How to iteratively search for optimal model parameters using
GridSearhCV
- How to tune model parameters for Decision Trees, Random Forests, and AdaBoost models.