In this lab, we'll create some popular Tree Ensemble models such as a Bag of Trees and a Random Forest to predict a person's salary based on information about them.
You will be able to:
- Create, train, and make predictions with Bagging Classifiers
- Create, train, and make predictions with a Random Forest
- Understand and explain the concept of bagging as it applies to Ensemble Methods
- Understand and explain the Subspace Sampling Method and it's use in Random Forests
In this lab, we'll be looking at a dataset of information about people and trying to predict if they make more than 50k/year. The salary data set was extracted from the census bureau database and contains salary information. The goal is to use this data set and to try to draw conclusions regarding what drives salaries. More specifically, the target variable is categorical (> 50k; <= 50 k). Let's create a classification tree!
To get started, run the cell below to import everything we'll need for this lab.
import pandas as pd
import numpy as np
np.random.seed(0)
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
Our dataset is stored in the file salaries_final.csv
.
In the cell below, read in the dataset from this file and store it in a DataFrame. Be sure to set the index_col
parameter to 0
. Then, display the head of the DataFrame to ensure that everything loaded correctly.
salaries = None
In total, there are 6 predictors, and one outcome variable, the target salary <= 50k/ >50k.
recall that the 6 predictors are:
-
Age
: continuous. -
Education
: Categorical. Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. -
Occupation
: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. -
Relationship
: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. -
Race
: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. -
Sex
: Female, Male.
First, we'll need to store our 'Target'
column in a separate variable and drop it from the dataset.
Do this in the cell below.
target = None
Next, we'll want to confirm that the Age column is currently encoded in a numeric data type, and not a string. By default, pandas will treat all columns encoded as strings as categorical columns, and create a dummy column for each unique value contained within that column. We do not want a separate column for each age, so let's double check that the age column is encoded as an integer or a float.
In the cell below, check the .dtypes
of the DataFrame to examine the data type of each column.
# Your code here
Great. Now we're ready to create some dummy columns and deal with our categorical variables.
In the cell below, use pandas to create dummy columns for each of categorical variables. If you're unsure of how to do this, check out the documentation.
data = None
Now, split your data and target into training and testing sets using the appropriate method from sklearn.
data_train, data_test, target_train, target_test = None
We'll begin by fitting a regular Decision Tree Classifier, so that we have something to compare our ensemble methods to.
In the cell below, create a Decision Tree Classifier. Set the criterion
to 'gini'
, and a max_depth
of 5
. Then, fit the tree to our training data and labels.
tree_clf = None
Let's quickly examine how important each feature ended up being in our Decision Tree model. Check the feature_importances_
attribute of our trained model to see what it displays.
# Your code here
That matrix isn't very helpful, but a visualization of the data it contains could be. Run the cell below to plot a visualization of the feature importances for this model. Run the cell below to create a visualization of the data stored inside of a model's .feature_importances_
attribute.
def plot_feature_importances(model):
n_features = data_train.shape[1]
plt.figure(figsize=(8,8))
plt.barh(range(n_features), model.feature_importances_, align='center')
plt.yticks(np.arange(n_features), data_train.columns.values)
plt.xlabel("Feature importance")
plt.ylabel("Feature")
plot_feature_importances(tree_clf)
Next, let's see how well our model performed on the data.
In the cell below:
- Use the classifier to create predictions on our test set.
- Print out a
confusion_matrix
of our test set predictions. - Print out a
classification_report
of our test set predictions.
pred = None
Now, let's check the model's accuracy. Run the cell below to display the test set accuracy of the model.
print("Testing Accuracy for Decision Tree Classifier: {:.4}%".format(accuracy_score(target_test, pred) * 100))
The first Ensemble approach we'll try is a Bag of Trees. This will make use of Bagging, along with a number of Decision Tree Classifier models.
Now, let's create a BaggingClassifier
. In the first parameter spot, initialize a DecisionTreeClassifier
and set the same parameters that we did above for criterion
and max_depth
. Also set the n_estimators
parameter for our Bagging Classifier to 20
.
bagged_tree = None
Great! Now, fit it to our training data.
# Your code here
Checking the accuracy of a model is such a common task that all (supervised learning) models contain a score()
method that wraps the accuracy_score
helper method we've been using. All we have to do is pass it a dataset and the corresponding labels and it will return the accuracy score for those data/labels.
Let's use it to get the training accuracy of our model. In the cell below, call the .score()
method on our Bagging model and pass in our training data and training labels as parameters.
# Your code here
Now, let's check the accuracy score that really matters--our testing accuracy. This time, pass in our testing data and labels to see how the model did.
# Your code here
Another popular ensemble method is the Random Forest model. Let's fit a Random Forest Classifier next and see how it measures up compared to all the others.
In the cell below, create a RandomForestClassifier
, and set the number estimators to 100
and the max depth to 5
. Then, fit the model to our training data.
forest = None
Now, let's check the training and testing accuracy of the model using its .score()
method.
# Your code here
# Your code here
plot_feature_importances(forest)
Note: "relationship" represents what this individual is relative to others. For example an individual could be a Husband. Each entry only has one relationship, so it is a bit of a weird attribute.
Also note that more features show up. This is a pretty typical result.
Let's create a forest with some small trees. You'll learn how to access trees in your forest!
In the cell below, create another RandomForestClassifier
. Set the number of estimators to 5, the max_features
to 10, and the max_depth
to 2.
forest_2 = None
Making max_features
smaller will lead to very different trees in your forest!
The trees in your forest are stored in the .estimators_
attribute.
In the cell below, get the first tree from forest_2.estimators_
and store it in rf_tree_1
rf_tree_1 = None
Now, we can reuse our plot_feature_importances
function to visualize which features this tree was given to use duing subspace sampling.
In the cell below, call plot_feature_importances
on rf_tree_1
.
# Your code here
Now, grab the second tree and store it in rf_tree_2
, and then pass it to plot_feature_importances
in the following cell so we can compare which features were most useful to each.
rf_tree_2 = None
# Your code here
We can see by comparing the two plots that the two trees we examined from our Random Forest look at different attributes, and have wildly different importances for them!
In this lab, we got some practice creating a few different Tree Ensemble Methods. We also learned how to visualize feature importances, and compared individual trees from a Random Forest to see if we could notice the differences in the features they were trained on.