In this lab, we'll create some popular tree ensemble models such as a bag of trees and random forest to predict a person's salary based on information about them.
In this lab you will:
- Train a random forest model using
scikit-learn
- Access, visualize, and interpret feature importances from an ensemble model
In this lab, you'll use personal attributes to predict whether people make more than 50k/year. The dataset was extracted from the census bureau database. The goal is to use this dataset to try and draw conclusions regarding what drives salaries. More specifically, the target variable is categorical (> 50k and <= 50 k). Let's create a classification tree!
To get started, run the cell below to import everything we'll need for this lab.
import pandas as pd
import numpy as np
np.random.seed(0)
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
Our dataset is stored in the file 'salaries_final.csv'
.
In the cell below, import the dataset from this file and store it in a DataFrame. Be sure to set the index_col
parameter to 0
. Then, display the .head()
of the DataFrame to ensure that everything loaded correctly.
# Import the data
salaries = None
In total, there are 6 predictors, and one outcome variable, the salary, Target
- <= 50k
and >50k
.
The 6 predictors are:
-
Age
: continuous -
Education
: Categorical. Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool -
Occupation
: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces -
Relationship
: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried -
Race
: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black -
Sex
: Female, Male
First, we'll need to store our 'Target'
column in a separate variable and drop it from the dataset.
Do this in the cell below.
# Split the outcome and predictor variables
target = None
In the cell below, examine the data type of each column:
# Your code here
Great. 'Age'
is numeric, as it should be. Now we're ready to create some dummy columns and deal with our categorical variables.
In the cell below, use Pandas to create dummy columns for each of categorical variables. If you're unsure of how to do this, check out the documentation.
# Create dummy variables
data = None
data.head()
Now, split data
and target
into 75/25 training and test sets. Set the random_state
to 123.
data_train, data_test, target_train, target_test = None
We'll begin by fitting a regular decision tree classifier, so that we have something to compare our ensemble methods to.
In the cell below, instantiate and fit a decision tree classifier. Set the criterion
to 'gini'
, and a max_depth
of 5
. Then, fit the tree to the training data and labels.
# Instantiate and fit a DecisionTreeClassifier
tree_clf = None
Let's quickly examine how important each feature ended up being in our decision tree model. Check the feature_importances_
attribute of the trained model to see what it displays.
# Feature importance
That matrix isn't very helpful, but a visualization of the data it contains could be. Run the cell below to plot a visualization of the feature importances for this model.
def plot_feature_importances(model):
n_features = data_train.shape[1]
plt.figure(figsize=(8,8))
plt.barh(range(n_features), model.feature_importances_, align='center')
plt.yticks(np.arange(n_features), data_train.columns.values)
plt.xlabel('Feature importance')
plt.ylabel('Feature')
plot_feature_importances(tree_clf)
Next, let's see how well our model performed on the test data.
In the cell below:
- Use the model to generate predictions on the test set
- Print out a
confusion_matrix
of the test set predictions - Print out a
classification_report
of the test set predictions
# Test set predictions
pred = None
# Confusion matrix and classification report
Now, let's check the model's accuracy. Run the cell below to display the test set accuracy of the model.
print("Testing Accuracy for Decision Tree Classifier: {:.4}%".format(accuracy_score(target_test, pred) * 100))
The first ensemble approach we'll try is a bag of trees. This will make use of Bagging, along with a number of decision tree classifier models.
Now, let's instantiate a BaggingClassifier
. First, initialize a DecisionTreeClassifier
and set the same parameters that we did above for criterion
and max_depth
. Also set the n_estimators
parameter for our BaggingClassifier
to 20
.
# Instantiate a BaggingClassifier
bagged_tree = None
Great! Now, fit it to our training data.
# Fit to the training data
Checking the accuracy of a model is such a common task that all (supervised learning) models have a .score()
method that wraps the accuracy_score()
helper function we've been using. All we have to do is pass it a dataset and the corresponding labels and it will return the accuracy score for those data/labels.
Let's use it to get the training accuracy of our model. In the cell below, call the .score()
method on our bagging model and pass in our training data and training labels as parameters.
# Training accuracy score
Now, let's check the accuracy score that really matters -- our testing accuracy. This time, pass in our testing data and labels to see how the model did.
# Test accuracy score
Another popular ensemble method is the Random Forest. Let's fit a random forest classifier next and see how it measures up compared to all the others.
In the cell below, instantiate and fit a RandomForestClassifier
, and set the number estimators to 100
and the max depth to 5
. Then, fit the model to our training data.
# Instantiate and fit a RandomForestClassifier
forest = None
Now, let's check the training and testing accuracy of the model using its .score()
method:
# Training accuracy score
# Test accuracy score
plot_feature_importances(forest)
Note: "relationship" represents what this individual is relative to others. For example an individual could be a Husband. Each entry only has one relationship, so it is a bit of a weird attribute.
Also note that more features show up. This is a pretty typical result.
Let's create a forest with some small trees. You'll learn how to access trees in your forest!
In the cell below, create another RandomForestClassifier
. Set the number of estimators to 5, the max_features
to 10, and the max_depth
to 2.
# Instantiate and fit a RandomForestClassifier
forest_2 = None
Making max_features
smaller will lead to very different trees in your forest! The trees in your forest are stored in the .estimators_
attribute.
In the cell below, get the first tree from forest_2.estimators_
and store it in rf_tree_1
# First tree from forest_2
rf_tree_1 = None
Now, we can reuse our plot_feature_importances()
function to visualize which features this tree was given to use duing subspace sampling.
In the cell below, call plot_feature_importances()
on rf_tree_1
.
# Feature importance
Now, grab the second tree and store it in rf_tree_2
, and then pass it to plot_feature_importances()
in the following cell so we can compare which features were most useful to each.
# Second tree from forest_2
rf_tree_2 = None
# Feature importance
We can see by comparing the two plots that the two trees we examined from our random forest look at different attributes, and have wildly different feature importances!
In this lab, we got some practice creating a few different tree ensemble methods. We also learned how to visualize feature importances, and compared individual trees from a random forest to see if we could notice the differences in the features they were trained on.