GridSearchCV - Lab

Introduction

In this lab, we'll explore how to use scikit-learn's GridSearchCV class to exhaustively search through every combination hyperparameters until we find the values for a given model.

Objectives

You will be able to:

Understand and explain parameter tuning and why it is necessary
Design and create a parameter grid for use with sklearn's GridSearchCV module
Use GridSearchCV to increase model performance through parameter tuning

The Dataset

For this lab, we'll be working with the Wine Quality Dataset from the UCI Machine Learning Dataset Repository. We'll be using data about the various features of wine to predict the quality of the wine on a scale from 1-10 stars, making this a multiclass classification problem.

Getting Started

Before we can begin GridSearching our way to optimal hyperparameters, we'll need to go through the basic steps of modeling. This means that we'll need to:

Import and inspect the dataset (and clean, if necessary)
Split the data into training and testing sets
Build and fit a baseline model that we can compare against our GridSearch results.

Run the cell below to import everything we'll need for this lab.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.metrics import accuracy_score

/anaconda3/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

Now that we've imported all the necessary libraries and frameworks for this lab, we'll need to get the dataset.

Our data is stored in the file winequality-red.csv. Use pandas to import the data from this file and store it in a DataFrame. Print the head to ensure that everything loaded correctly.

df = None

Great! Let's inspect our data a bit. In the cell below, perform some basic Exploratory Data Analysis on our dataset. Get a feel for your data by exploring the descriptive statistics and creating at least 1 visualization to help you better understand this dataset.

Question: Based on your findings during your Eploratory Data Analysis, do you think that we need to do any sort of preprocessing on this dataset? Why or why not?

Write your answer below this line:

Preprocessing our Data

Now, we'll perform any necessary preprocessing on our dataset before training our model. We'll start by isolating the target variable that we are trying to predict. In the cell below:

Store the data in the quality column inside the labels variable
Drop the quality column from the dataset

labels = None
labels_removed_df = None

Now that we've isolated our labels, we'll need to normalize our dataset (also referred to as scaling).

In the cell below:

Create a StandardScaler() object.
Transform the data in labels_removed_df using the scaler object's fit_transform() method.

scaler = None
scaled_df = None

Training, Testing, and Cross Validation

Normally, we would also split our data into training and testing sets. However, since we'll be making use of Cross Validation when using GridSearchCV, we'll also want to make use of it with our baseline model to ensure that things are equal. Recall that we do not need to split our data into training and testing sets when using cross validation, since the cross validation will take care of that for us.

Creating a Baseline Model: Decision Trees

In the cell below:

Create a DecisionTreeClassifier object.
Get the cross_val_score for this model, with the cv parameter set to 3.
Calculate and print the mean cross-validation score from our model.

Note: If you need a refresher on how to use cross_val_score, check out the documentation.

dt_clf = None
dt_cv_score = None
mean_dt_cv_score = None

# print("Mean Cross Validation Score: {:.4}%".format(mean_dt_cv_score * 100))

Grid Search: Decision Trees

Take a second to interpret the results of our cross-validation score. How well did our model do? How does this compare to a naive baseline level of accuracy (random guessing)?

Write your answer below this line:

Creating A Parameter Grid

So far, our model has not have stellar performance. However, we've yet to modify the hyperparameters of the model. Each dataset is different, and the chances that the best possible parameters for a given dataset also happen to be the default parameters set by by sklearn at instantiation is very low.

This means that we need to try Hyperparameter Tuning. There are several strategies for searching for optimal hyperparameters--the one we'll be using, Combinatoric Grid Searching, is probably the most popular, because it performs an exhaustive search of all possible combinations.

The sklearn module we'll be using to accomplish this is GridSearchCV, which can be found inside of sklearn.model_selection.

Take a minute to look at sklearn's user guide for GridSearchCV, and then complete the following task.

In the cell below:

Complete the param_grid dictionary. In this dictionary, each key represents a parameter we want to tune, whereas the corresponding value is an array of every parameter value we'd like to check for that parameter. For instance, if we would like try out the values 2, 5, and 10 for min_samples_split, our param_grid dictionary would include "min_samples_split": [2, 5, 10].
Normally, you would have to just try different values to search through for each parameter. However, in order to limit the complexity of this lab, the parameters and values to search through have been provided for you. You just need to turn them into key-value pairs inside of the param_grid dictionary. Complete param_grid so that it tests the following values for each corresponding parameter:
- For "criterion", try values of "gini" and "entropy".
- For "max_depth", try None, as well as 2, 3, 4, 5 and 6.
- For min_samples_split, try 2, 5, and 10.
- For "min_samples_leaf", try 1, 2, 3, 4, 5 and 6.

dt_param_grid = {
 
}

Now that we have our parameter grid set up, we can create and use our GridSearchCV object. Before we do, let's briefly think about the particulars of this model.

Grid Searching works by training a model on the data for each unique combination of parameters, and then returning the parameters of the model that performed best. In order to protect us from randomness, it is common to implement K-Fold Cross Validation during this step. For this lab, we'll set K = 3, meaning that we'll actually train 3 different models for each unique combination of parameters.

Given our param_grid and the knowledge that we're going to use Cross Validation with a value of 3, how many different Decision Trees will our GridSearchCV object have to train in order to try every possible combination and find the best parameter choices?

Calculate and print your answer in the cell below.

num_decision_trees = None
print("Grid Search will have to search through {} different permutations.".format(num_decision_trees))

Grid Search will have to search through None different permutations.

That's alot of Decision Trees! Decision Trees are generally pretty quick to train, but that isn't the case with every type of model we could want to tune. Be aware that if you set a particularly large search space of parameters inside your parameter grid, then Grid Searching could potentially take a very long time.

Let's create our GridSearchCV object and fit it. In the cell below:

Create a GridSearchCV object. Pass in our model, the parameter grid, and cv=3 to tell the object to use 3-Fold Cross Validation. Also pass in return
Call our grid search object's fit() method and pass in our data and labels, just as if we were using regular cross validation.

dt_grid_search = None
#dt_grid_search.fit(None, None)

Examining the Best Parameters

Now that we have fit our model using Grid Search, we need to inspect it to discover the optimal combination of parameters.

In the cell below:

Calculate the the mean training score. An array of training score results can be found inside of the .cv_results_ dictionary, with the key mean_train_score.
Calcuate the testing score using the our grid search model's .score() method by passing in our data and labels.
Examine the appropriate attribute to discover the best estimator parameters found during the grid search.

HINT: If you're unsure what attribute this is stored in, take a look at sklearn's GridSearchCV Documentation.

dt_gs_training_score = None
dt_gs_testing_score = None

# print("Mean Training Score: {:.4}%".format(dt_gs_training_score * 100))
# print("Mean Testing Score: {:.4}%".format(dt_gs_testing_score * 100))
# print("Best Parameter Combination Found During Grid Search:")
# dt_grid_search.best_params_

Question: What effect, if any, did our parameter tuning have on model performance? Will GridSearchCV always discover a perfectly (global) optimal set of parameters? Why or why not?

Tuning More Advanced Models: Random Forests

Now that we have some experience with Grid Searching through parameter values for a Decision Tree Classifier, let's try our luck with a more advanced model and tune a Random Forest Classifier.

We'll start by repeating the same process we did for our Decision Tree Classifier, except with a Random Forest Classifier instead.

In the cell below:

Create a RandomForestClassifier object.
Use Cross Validation with cv=3 to generate a baseline score for this model type, so that we have something to compare our tuned model performance to.

rf_clf = None
mean_rf_cv_score = None
# print("Mean Cross Validation Score for Random Forest Classifier: {:.4}%".format(mean_rf_cv_score * 100))

Now that we have our baseline score, we'll create a parameter grid specific to our Random Forest Classifier.

Again--in a real world situation, you will need to decide what parameters to tune, and be very thoughtful about what values to test for each parameter. However, since this is a lab, we have provided the following table in the interest of simplicity. Complete the rf_param_grid dictionary with the following key value pairs:

Parameter	Values
n_estimators	[10, 30, 100]
criterion	['gini', 'entropy']
max_depth	[None, 2, 6, 10]
min_samples_split	[5, 10]
min_samples_leaf	[3, 6]

rf_param_grid = {
}

Great! Now that we have our parameter grid, we can grid search through it with our Random Forest.

In the cell below, follow the process we used with Decision Trees above to grid search for the best parameters for our Random Forest Classifier.

When creating your GridSearchCV object, pass in:

our Random Forest Classifier
The parameter grid for our Random Forest Classifier
cv=3
Do not pass in return_train_score as we did with our Decision Trees example above. In the interest of runtime, we'll only worry about testing accuracy this time.

NOTE: The runtime on the following cell will be over a minute on most computers.

import time
start = time.time()
rf_grid_search =None
# rf_grid_search.fit(None, None)

# print("Testing Accuracy: {:.4}%".format(rf_grid_search.best_score_ * 100))
# print("Total Runtime for Grid Search on Random Forest Classifier: {:.4} seconds".format(time.time() - start))
# print("")
# print("Optimal Parameters: {}".format(rf_grid_search.best_params_))

Interpreting Our Results

Did tuning the hyperparameters of our Random Forest Classifier improve model performance? Is this performance increase significant? Which model did better? If you had to choose, which model would you put into production? Explain your answer.

Write your answer below this line:

Tuning Gradient Boosted Trees (AdaBoost)

The last model we'll tune in this lab is an AdaBoost Classifier, although tuning this model will generally be similar to tuning other forms of Gradient Boosted Tree (GBT) models.

In the cell below, create an AdaBoost Classifier Object. Then, as we did with the previous two examples, fit the model using using Cross Validation to get a baseline testing accuracy so we can see how an untuned AdaBoost model performs on this task.

adaboost_clf = None
adaboost_mean_cv_score = None

# print("Mean Cross Validation Score for AdaBoost: {:.4}%".format(adaboost_mean_cv_score * 100))

Great! Now, onto creating the parameter grid for AdaBoost.

Complete the adaboost_param_grid dictionary by adding in the following key-value pairs:

Parameters	Values
n_estimators	[50, 100, 250]
learning_rate	[1.0, 0.5, 0.1]

adaboost_param_grid = {
    
}

Great. Now, for the finale--use Grid Search to find optimal parameters for AdaBoost, and see how the model performs overall!

adaboost_grid_search = None
# adaboost_grid_search.fit(None, None)

# print("Testing Accuracy: {:.4}%".format(adaboost_grid_search.best_score_ * 100))
# print("Total Runtime for Grid Search on AdaBoost: {:.4} seconds".format(time.time() - start))
# print("")
# print("Optimal Parameters: {}".format(adaboost_grid_search.best_params_))

Summary

In this lab, we learned:

How to iteratively search for optimal model parameters using GridSearhCV
How to tune model parameters for Decision Trees, Random Forests, and AdaBoost models.

AustinKrause/dsc-3-32-06-gridsearchcv-lab-nyc-ds-career-031119

GridSearchCV - Lab

Introduction

Objectives

The Dataset

Getting Started

Preprocessing our Data

Training, Testing, and Cross Validation

Creating a Baseline Model: Decision Trees

Grid Search: Decision Trees

Creating A Parameter Grid

Examining the Best Parameters

Tuning More Advanced Models: Random Forests

Interpreting Our Results

Tuning Gradient Boosted Trees (AdaBoost)

Summary