In this cumulative lab, you will apply two nonparametric models you have just learned — k-nearest neighbors and decision trees — to the forest cover dataset.
- Practice identifying and applying appropriate preprocessing steps
- Perform an iterative modeling process, starting from a baseline model
- Explore multiple model algorithms, and tune their hyperparameters
- Practice choosing a final model across multiple model algorithms and evaluating its performance
Photo by Michael Benz on Unsplash
To repeat the previous description:
Here we will be using an adapted version of the forest cover dataset from the UCI Machine Learning Repository. Each record represents a 30 x 30 meter cell of land within Roosevelt National Forest in northern Colorado, which has been labeled as
Cover_Type
1 for "Cottonwood/Willow" andCover_Type
0 for "Ponderosa Pine". (The original dataset contained 7 cover types but we have simplified it.)
The task is to predict the Cover_Type
based on the available cartographic variables:
# Run this cell without changes
import pandas as pd
df = pd.read_csv('data/forest_cover.csv')
df
As you can see, we have over 38,000 rows, each with 52 feature columns and 1 target column:
Elevation
: Elevation in metersAspect
: Aspect in degrees azimuthSlope
: Slope in degreesHorizontal_Distance_To_Hydrology
: Horizontal dist to nearest surface water features in metersVertical_Distance_To_Hydrology
: Vertical dist to nearest surface water features in metersHorizontal_Distance_To_Roadways
: Horizontal dist to nearest roadway in metersHillshade_9am
: Hillshade index at 9am, summer solsticeHillshade_Noon
: Hillshade index at noon, summer solsticeHillshade_3pm
: Hillshade index at 3pm, summer solsticeHorizontal_Distance_To_Fire_Points
: Horizontal dist to nearest wildfire ignition points, metersWilderness_Area_x
: Wilderness area designation (3 columns)Soil_Type_x
: Soil Type designation (39 columns)Cover_Type
: 1 for cottonwood/willow, 0 for ponderosa pine
This is also an imbalanced dataset, since cottonwood/willow trees are relatively rare in this forest:
# Run this cell without changes
print("Raw Counts")
print(df["Cover_Type"].value_counts())
print()
print("Percentages")
print(df["Cover_Type"].value_counts(normalize=True))
Thus, a baseline model that always chose the majority class would have an accuracy of over 92%. Therefore we will want to report additional metrics at the end.
In a previous lab, we used SMOTE to create additional synthetic data, then tuned the hyperparameters of a logistic regression model to get the following final model metrics:
- Log loss: 0.13031294393913376
- Accuracy: 0.9456679825472678
- Precision: 0.6659919028340081
- Recall: 0.47889374090247455
In this lab, you will try to beat those scores using more-complex, nonparametric models.
Although you may be aware of some additional model algorithms available from scikit-learn, for this lab you will be focusing on two of them: k-nearest neighbors and decision trees. Here are some reminders about these models:
kNN - documentation here
This algorithm — unlike linear models or tree-based models — does not emphasize learning the relationship between the features and the target. Instead, for a given test record, it finds the most similar records in the training set and returns an average of their target values.
- Training speed: Fast. In theory it's just saving the training data for later, although the scikit-learn implementation has some additional logic "under the hood" to make prediction faster.
- Prediction speed: Very slow. The model has to look at every record in the training set to find the k closest to the new record.
- Requires scaling: Yes. The algorithm to find the nearest records is distance-based, so it matters that distances are all on the same scale.
- Key hyperparameters:
n_neighbors
(how many nearest neighbors to find; too few neighbors leads to overfitting, too many leads to underfitting),p
andmetric
(what kind of distance to use in defining "nearest" neighbors)
Decision Trees - documentation here
Similar to linear models (and unlike kNN), this algorithm emphasizes learning the relationship between the features and the target. However, unlike a linear model that tries to find linear relationships between each of the features and the target, decision trees look for ways to split the data based on features to decrease the entropy of the target in each split.
- Training speed: Slow. The model is considering splits based on as many as all of the available features, and it can split on the same feature multiple times. This requires exponential computational time that increases based on the number of columns as well as the number of rows.
- Prediction speed: Medium fast. Producing a prediction with a decision tree means applying several conditional statements, which is slower than something like logistic regression but faster than kNN.
- Requires scaling: No. This model is not distance-based. You also can use a
LabelEncoder
rather thanOneHotEncoder
for categorical data, since this algorithm doesn't necessarily assume that the distance between1
and2
is the same as the distance between2
and3
. - Key hyperparameters: Many features relating to "pruning" the tree. By default they are set so the tree can overfit, and by setting them higher or lower (depending on the hyperparameter) you can reduce overfitting, but too much will lead to underfitting. These are:
max_depth
,min_samples_split
,min_samples_leaf
,min_weight_fraction_leaf
,max_features
,max_leaf_nodes
, andmin_impurity_decrease
. You can also try changing thecriterion
to "entropy" or thesplitter
to "random" if you want to change the splitting logic.
The target is Cover_Type
. In the cell below, split df
into X
and y
, then perform a train-test split with random_state=42
and stratify=y
to create variables with the standard X_train
, X_test
, y_train
, y_test
names.
Include the relevant imports as you go.
# Your code here
Now, instantiate a StandardScaler
, fit it on X_train
, and create new variables X_train_scaled
and X_test_scaled
containing values transformed with the scaler.
# Your code here
The following code checks that everything is set up correctly:
# Run this cell without changes
# Checking that df was separated into correct X and y
assert type(X) == pd.DataFrame and X.shape == (38501, 52)
assert type(y) == pd.Series and y.shape == (38501,)
# Checking the train-test split
assert type(X_train) == pd.DataFrame and X_train.shape == (28875, 52)
assert type(X_test) == pd.DataFrame and X_test.shape == (9626, 52)
assert type(y_train) == pd.Series and y_train.shape == (28875,)
assert type(y_test) == pd.Series and y_test.shape == (9626,)
# Checking the scaling
assert X_train_scaled.shape == X_train.shape
assert round(X_train_scaled[0][0], 3) == -0.636
assert X_test_scaled.shape == X_test.shape
assert round(X_test_scaled[0][0], 3) == -1.370
Build a scikit-learn kNN model with default hyperparameters. Then use cross_val_score
with scoring="neg_log_loss"
to find the mean log loss for this model (passing in X_train_scaled
and y_train
to cross_val_score
). You'll need to find the mean of the cross-validated scores, and negate the value (either put a -
at the beginning or multiply by -1
) so that your answer is a log loss rather than a negative log loss.
Call the resulting score knn_baseline_log_loss
.
Your code might take a minute or more to run.
# Replace None with appropriate code
# Relevant imports
None
# Creating the model
knn_baseline_model = None
# Perform cross-validation
knn_baseline_log_loss = None
knn_baseline_log_loss
Our best logistic regression model had a log loss of 0.13031294393913376
Is this model better? Compare it in terms of metrics and speed.
# Replace None with appropriate text
"""
None
"""
Build and evaluate at least two more kNN models to find the best one. Explain why you are changing the hyperparameters you are changing as you go. These models will be slow to run, so be thinking about what you might try next as you run them.
# Your code here (add more cells as needed)
# Your code here (add more cells as needed)
# Your code here (add more cells as needed)
Now that you have chosen your best kNN model, start investigating decision tree models. First, build and evaluate a baseline decision tree model, using default hyperparameters (with the exception of random_state=42
for reproducibility).
(Use cross-validated log loss, just like with the previous models.)
# Your code here
Interpret this score. How does this compare to the log loss from our best logistic regression and best kNN models? Any guesses about why?
# Replace None with appropriate text
"""
None
"""
Build and evaluate at least two more decision tree models to find the best one. Explain why you are changing the hyperparameters you are changing as you go.
# Your code here (add more cells as needed)
# Your code here (add more cells as needed)
# Your code here (add more cells as needed)
Which model had the best performance? What type of model was it?
Instantiate a variable final_model
using your best model with the best hyperparameters.
# Replace None with appropriate code
final_model = None
# Fit the model on the full training data
# (scaled or unscaled depending on the model)
None
Now, evaluate the log loss, accuracy, precision, and recall. This code is mostly filled in for you, but you need to replace None
with either X_test
or X_test_scaled
depending on the model you chose.
# Replace None with appropriate code
from sklearn.metrics import accuracy_score, precision_score, recall_score
preds = final_model.predict(None)
probs = final_model.predict_proba(None)
print("log loss: ", log_loss(y_test, probs))
print("accuracy: ", accuracy_score(y_test, preds))
print("precision:", precision_score(y_test, preds))
print("recall: ", recall_score(y_test, preds))
Interpret your model performance. How would it perform on different kinds of tasks? How much better is it than a "dummy" model that always chooses the majority class, or the logistic regression described at the start of the lab?
# Replace None with appropriate text
"""
None
"""
In this lab, you practiced the end-to-end machine learning process with multiple model algorithms, including tuning the hyperparameters for those different algorithms. You saw how nonparametric models can be more flexible than linear models, potentially leading to overfitting but also potentially reducing underfitting by being able to learn non-linear relationships between variables. You also likely saw how there can be a tradeoff between speed and performance, with good metrics correlating with slow speeds.