Tree Ensembles and Random Forests - Lab

Introduction

In this lab, we'll create some popular tree ensemble models such as a bag of trees and random forest to predict a person's salary based on information about them.

Objectives

In this lab you will:

Train a random forest model using scikit-learn
Access, visualize, and interpret feature importances from an ensemble model

Import data

In this lab, you'll use personal attributes to predict whether people make more than 50k/year. The dataset was extracted from the census bureau database. The goal is to use this dataset to try and draw conclusions regarding what drives salaries. More specifically, the target variable is categorical (> 50k and <= 50 k). Let's create a classification tree!

To get started, run the cell below to import everything we'll need for this lab.

import pandas as pd
import numpy as np
np.random.seed(0)
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier

Our dataset is stored in the file 'salaries_final.csv'.

In the cell below, import the dataset from this file and store it in a DataFrame. Be sure to set the index_col parameter to 0. Then, display the .head() of the DataFrame to ensure that everything loaded correctly.

# Import the data
salaries = None

In total, there are 6 predictors, and one outcome variable, the salary, Target - <= 50k and >50k.

The 6 predictors are:

Age: continuous
Education: Categorical. Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool
Occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces
Relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried
Race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black
Sex: Female, Male

First, we'll need to store our 'Target' column in a separate variable and drop it from the dataset.

Do this in the cell below.

# Split the outcome and predictor variables
target = None

In the cell below, examine the data type of each column:

# Your code here

Great. 'Age' is numeric, as it should be. Now we're ready to create some dummy columns and deal with our categorical variables.

In the cell below, use Pandas to create dummy columns for each of categorical variables. If you're unsure of how to do this, check out the documentation.

# Create dummy variables
data = None
data.head()

Now, split data and target into 75/25 training and test sets. Set the random_state to 123.

data_train, data_test, target_train, target_test = None

Build a "regular" tree as a baseline

We'll begin by fitting a regular decision tree classifier, so that we have something to compare our ensemble methods to.

Build the tree

In the cell below, instantiate and fit a decision tree classifier. Set the criterion to 'gini', and a max_depth of 5. Then, fit the tree to the training data and labels.

# Instantiate and fit a DecisionTreeClassifier
tree_clf = None

Feature importance

Let's quickly examine how important each feature ended up being in our decision tree model. Check the feature_importances_ attribute of the trained model to see what it displays.

# Feature importance

That matrix isn't very helpful, but a visualization of the data it contains could be. Run the cell below to plot a visualization of the feature importances for this model.

def plot_feature_importances(model):
    n_features = data_train.shape[1]
    plt.figure(figsize=(8,8))
    plt.barh(range(n_features), model.feature_importances_, align='center') 
    plt.yticks(np.arange(n_features), data_train.columns.values) 
    plt.xlabel('Feature importance')
    plt.ylabel('Feature')

plot_feature_importances(tree_clf)

Model performance

Next, let's see how well our model performed on the test data.

In the cell below:

Use the model to generate predictions on the test set
Print out a confusion_matrix of the test set predictions
Print out a classification_report of the test set predictions

# Test set predictions
pred = None

# Confusion matrix and classification report

Now, let's check the model's accuracy. Run the cell below to display the test set accuracy of the model.

print("Testing Accuracy for Decision Tree Classifier: {:.4}%".format(accuracy_score(target_test, pred) * 100))

Bagged trees

The first ensemble approach we'll try is a bag of trees. This will make use of Bagging, along with a number of decision tree classifier models.

Now, let's instantiate a BaggingClassifier. First, initialize a DecisionTreeClassifier and set the same parameters that we did above for criterion and max_depth. Also set the n_estimators parameter for our BaggingClassifier to 20.

# Instantiate a BaggingClassifier
bagged_tree = None

Great! Now, fit it to our training data.

# Fit to the training data

Checking the accuracy of a model is such a common task that all (supervised learning) models have a .score() method that wraps the accuracy_score() helper function we've been using. All we have to do is pass it a dataset and the corresponding labels and it will return the accuracy score for those data/labels.

Let's use it to get the training accuracy of our model. In the cell below, call the .score() method on our bagging model and pass in our training data and training labels as parameters.

# Training accuracy score

Now, let's check the accuracy score that really matters -- our testing accuracy. This time, pass in our testing data and labels to see how the model did.

# Test accuracy score

Random forests

Another popular ensemble method is the Random Forest. Let's fit a random forest classifier next and see how it measures up compared to all the others.

Fit a random forests model

In the cell below, instantiate and fit a RandomForestClassifier, and set the number estimators to 100 and the max depth to 5. Then, fit the model to our training data.

# Instantiate and fit a RandomForestClassifier
forest = None

Now, let's check the training and testing accuracy of the model using its .score() method:

# Training accuracy score

# Test accuracy score

Feature importance

plot_feature_importances(forest)

Note: "relationship" represents what this individual is relative to others. For example an individual could be a Husband. Each entry only has one relationship, so it is a bit of a weird attribute.

Also note that more features show up. This is a pretty typical result.

Look at the trees in your forest

Let's create a forest with some small trees. You'll learn how to access trees in your forest!

In the cell below, create another RandomForestClassifier. Set the number of estimators to 5, the max_features to 10, and the max_depth to 2.

# Instantiate and fit a RandomForestClassifier
forest_2 = None

Making max_features smaller will lead to very different trees in your forest! The trees in your forest are stored in the .estimators_ attribute.

In the cell below, get the first tree from forest_2.estimators_ and store it in rf_tree_1

# First tree from forest_2
rf_tree_1 = None

Now, we can reuse our plot_feature_importances() function to visualize which features this tree was given to use duing subspace sampling.

In the cell below, call plot_feature_importances() on rf_tree_1.

# Feature importance

Now, grab the second tree and store it in rf_tree_2, and then pass it to plot_feature_importances() in the following cell so we can compare which features were most useful to each.

# Second tree from forest_2
rf_tree_2 = None

# Feature importance

We can see by comparing the two plots that the two trees we examined from our random forest look at different attributes, and have wildly different feature importances!

Summary

In this lab, we got some practice creating a few different tree ensemble methods. We also learned how to visualize feature importances, and compared individual trees from a random forest to see if we could notice the differences in the features they were trained on.

srobz/dsc-tree-ensembles-random-forests-lab-online-ds-sp-000

Tree Ensembles and Random Forests - Lab

Introduction

Objectives

Import data

Build a "regular" tree as a baseline

Build the tree

Feature importance

Model performance

Bagged trees

Random forests

Fit a random forests model

Feature importance

Look at the trees in your forest

Summary