In this cumulative lab, you will walk through a complete machine learning workflow with logistic regression, including data preparation, modeling (including hyperparameter tuning), and final model evaluation.
You will be able to:
- Practice identifying and applying appropriate preprocessing steps
- Perform an iterative modeling process, starting from a baseline model
- Practice model validation
- Practice choosing a final logistic regression model and evaluating its performance
Photo by Ivana Cajina on Unsplash
Here we will be using an adapted version of the forest cover dataset from the UCI Machine Learning Repository. Each record represents a 30 x 30 meter cell of land within Roosevelt National Forest in northern Colorado, which has been labeled as Cover_Type
1 for "Cottonwood/Willow" and Cover_Type
0 for "Ponderosa Pine". (The original dataset contained 7 cover types but we have simplified it.)
The task is to predict the Cover_Type
based on the available cartographic variables:
# Run this cell without changes
import pandas as pd
df = pd.read_csv('data/forest_cover.csv')
df
As you can see, we have over 38,000 rows, each with 52 feature columns and 1 target column:
Elevation
: Elevation in metersAspect
: Aspect in degrees azimuthSlope
: Slope in degreesHorizontal_Distance_To_Hydrology
: Horizontal dist to nearest surface water features in metersVertical_Distance_To_Hydrology
: Vertical dist to nearest surface water features in metersHorizontal_Distance_To_Roadways
: Horizontal dist to nearest roadway in metersHillshade_9am
: Hillshade index at 9am, summer solsticeHillshade_Noon
: Hillshade index at noon, summer solsticeHillshade_3pm
: Hillshade index at 3pm, summer solsticeHorizontal_Distance_To_Fire_Points
: Horizontal dist to nearest wildfire ignition points, metersWilderness_Area_x
: Wilderness area designation (3 columns)Soil_Type_x
: Soil Type designation (39 columns)Cover_Type
: 1 for cottonwood/willow, 0 for ponderosa pine
This is also an imbalanced dataset, since cottonwood/willow trees are relatively rare in this forest:
# Run this cell without changes
print("Raw Counts")
print(df["Cover_Type"].value_counts())
print()
print("Percentages")
print(df["Cover_Type"].value_counts(normalize=True))
If we had a model that always said that the cover type was ponderosa pine (class 0), what accuracy score would we get?
# Replace None with appropriate text
"""
None
"""
You will need to take this into account when working through this problem.
For a complete end-to-end ML process, we need to create a holdout set that we will use at the very end to evaluate our final model's performance.
Without performing any preprocessing or hyperparameter tuning, build and evaluate a vanilla logistic regression model using log loss and cross_val_score
.
Because we are using preprocessing techniques that differ for train and validation data, we will need a custom function rather than simply preprocessing the entire X_train
and using cross_val_score
from scikit-learn.
Using the function created in the previous step, build multiple logistic regression models with different hyperparameters in order to minimize log loss.
Preprocess the full training set and test set appropriately, then evaluate the final model with various classification metrics in addition to log loss.
This process should be fairly familiar by now. In the cell below, use the variable df
(that has already been created) in order to create X
and y
, then training and test sets using train_test_split
(documentation here).
We'll use a random state of 42 and stratify=y
(to ensure an even balance of tree types) in the train-test split. Recall that the target is Cover_Type
.
# Replace None with appropriate code
# Import the relevant function
None
# Split df into X and y
X = None
y = None
# Perform train-test split with random_state=42 and stratify=y
X_train, X_test, y_train, y_test = None
Check that you have the correct data shape before proceeding:
# Run this cell without changes
# X and y training data should have the same number of rows
assert X_train.shape[0] == y_train.shape[0] and X_train.shape[0] == 28875
# X and y testing data should have the same number of rows
assert X_test.shape[0] == y_test.shape[0] and X_test.shape[0] == 9626
# Both X should have 52 columns
assert X_train.shape[1] == X_test.shape[1] and X_train.shape[1] == 52
# Both y should have 1 column
assert len(y_train.shape) == len(y_test.shape) and len(y_train.shape) == 1
Also, we should have roughly equal percentages of cottonwood/willow trees for train vs. test targets:
# Run this cell without changes
print("Train percent cottonwood/willow:", y_train.value_counts(normalize=True)[1])
print("Test percent cottonwood/willow: ", y_test.value_counts(normalize=True)[1])
Using scikit-learn's LogisticRegression
model, instantiate a classifier with random_state=42
. Then use cross_val_score
with scoring="neg_log_loss"
to find the average cross-validated log loss for this model on X_train
and y_train
.
(Similar to RMSE, the internal implementation of cross_val_score
requires that we use "negative log loss" instead of just log loss. The code provided negates the result for you.)
The code below should produce a warning but not an error. Because we have not scaled the data, we expect to get a ConvergenceWarning
five times (once for each fold of cross validation).
# Replace None with appropriate code
# Import relevant class and function
None
None
# Instantiate a LogisticRegression with random_state=42
baseline_model = None
# Use cross_val_score with scoring="neg_log_loss" to evaluate the model
# on X_train and y_train
baseline_neg_log_loss_cv = None
baseline_log_loss = -(baseline_neg_log_loss_cv.mean())
baseline_log_loss
Ok, so we are getting the ConvergenceWarning
s we expected, and log loss of around 0.172 with our baseline model.
Is that a "good" log loss? That's hard to say — log loss is not particularly interpretable.
If we had a model that just chose 0 (the majority class) every time, this is the log loss we would get:
# Run this cell without changes
from sklearn.metrics import log_loss
import numpy as np
log_loss(y_train, np.zeros(len(y_train)))
Loss is a metric where lower is better, so our baseline model is clearly an improvement over just guessing the majority class every time.
Even though it is difficult to interpret, the 0.172 value will be a useful baseline as we continue modeling, to see if we are actually making improvements or just getting slightly better performance by chance.
We will also use other metrics at the last step in order to describe the final model's performance in a more user-friendly way.
First, consider: which preprocessing steps should be taken with this dataset? Recall that our data is imbalanced, and that it caused a ConvergenceWarning
for our baseline model.
# Replace None with appropriate text
"""
None
"""
As you likely noted above, we should use some kind of resampling technique to address the large class imbalance. Let's use SMOTE
(synthetic minority oversampling, documentation here), which creates synthetic examples of the minority class to help train the model.
Does SMOTE work just like a typical scikit-learn transformer, where you fit the transformer on the training data then transform both the training and the test data?
# Replace None with appropriate text
"""
None
"""
# __SOLUTION
"""
No, SMOTE does not work like that. We never want to oversample the
minority class in the test data, because then we are generating
metrics based on synthetic data and not actual data.
Instead, we only want to fit and transform the training data, and
leave the testing data alone.
"""
As you also likely noted above, we should use some transformer to normalize the data. Let's use a StandardScaler
(documentation here).
Does StandardScaler
work just like a typical scikit-learn transformer, where you fit the transformer on the training data then transform both the training and the test data?
# Replace None with appropriate text
"""
None
"""
(At this point it's a good idea to double-check your answers against the solution
branch to make sure you understand the setup.)
As you can see from the cross_val_score
documentation linked above, "under the hood" it is using StratifiedKFold
for classification tasks.
Essentially StratifiedKFold
is just providing the information you need to make 5 separate train-test splits inside of X_train
. Then there is other logic within cross_val_score
to fit and evaluate the provided model.
So, if our original code looked like this:
baseline_model = LogisticRegression(random_state=42)
baseline_neg_log_loss_cv = cross_val_score(baseline_model, X_train, y_train, scoring="neg_log_loss")
baseline_log_loss = -(baseline_neg_log_loss_cv.mean())
baseline_log_loss
The equivalent of the above code using StratifiedKFold
would look something like this:
# Run this cell without changes
from sklearn.metrics import make_scorer
from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone
# Negative log loss doesn't exist as something we can import,
# but we can create it
neg_log_loss = make_scorer(log_loss, greater_is_better=False, needs_proba=True)
# Instantiate the model (same as previous example)
baseline_model = LogisticRegression(random_state=42)
# Create a list to hold the score from each fold
kfold_scores = np.ndarray(5)
# Instantiate a splitter object and loop over its result
kfold = StratifiedKFold()
for fold, (train_index, val_index) in enumerate(kfold.split(X_train, y_train)):
# Extract train and validation subsets using the provided indices
X_t, X_val = X_train.iloc[train_index], X_train.iloc[val_index]
y_t, y_val = y_train.iloc[train_index], y_train.iloc[val_index]
# Clone the provided model and fit it on the train subset
temp_model = clone(baseline_model)
temp_model.fit(X_t, y_t)
# Evaluate the provided model on the validation subset
neg_log_loss_score = neg_log_loss(temp_model, X_val, y_val)
kfold_scores[fold] = neg_log_loss_score
-(kfold_scores.mean())
As you can see, this produced the same result as our original cross validation (including the ConvergenceWarning
s):
# Run this cell without changes
print(baseline_neg_log_loss_cv)
print(kfold_scores)
So, what is the point of doing it this way, instead of the much-shorter cross_val_score
approach?
Using StratifiedKFold
"by hand" allows us to customize what happens inside of that loop.
Therefore we can apply these preprocessing techniques appropriately:
- Fit a
StandardScaler
object on the training subset (not the full training data) and transform both the train and test subsets - Fit a
SMOTE
object and transform only the training subset
In the cell below, we have set up a function custom_cross_val_score
that has an interface that resembles the cross_val_score
function from scikit-learn.
Most of it is set up for you already, all you need to do is add the SMOTE
and StandardScaler
steps described above.
# Replace None with appropriate code
# Import relevant sklearn and imblearn classes
None
None
def custom_cross_val_score(estimator, X, y):
# Create a list to hold the scores from each fold
kfold_train_scores = np.ndarray(5)
kfold_val_scores = np.ndarray(5)
# Instantiate a splitter object and loop over its result
kfold = StratifiedKFold(n_splits=5)
for fold, (train_index, val_index) in enumerate(kfold.split(X, y)):
# Extract train and validation subsets using the provided indices
X_t, X_val = X.iloc[train_index], X.iloc[val_index]
y_t, y_val = y.iloc[train_index], y.iloc[val_index]
# Instantiate StandardScaler
scaler = None
# Fit and transform X_t
X_t_scaled = None
# Transform X_val
X_val_scaled = None
# Instantiate SMOTE with random_state=42 and sampling_strategy=0.28
sm = None
# Fit and transform X_t_scaled and y_t using sm
X_t_oversampled, y_t_oversampled = None
# Clone the provided model and fit it on the train subset
temp_model = clone(estimator)
temp_model.fit(X_t_oversampled, y_t_oversampled)
# Evaluate the provided model on the train and validation subsets
neg_log_loss_score_train = neg_log_loss(temp_model, X_t_oversampled, y_t_oversampled)
neg_log_loss_score_val = neg_log_loss(temp_model, X_val_scaled, y_val)
kfold_train_scores[fold] = neg_log_loss_score_train
kfold_val_scores[fold] = neg_log_loss_score_val
return kfold_train_scores, kfold_val_scores
model_with_preprocessing = LogisticRegression(random_state=42, class_weight={1: 0.28})
preprocessed_train_scores, preprocessed_neg_log_loss_cv = custom_cross_val_score(model_with_preprocessing, X_train, y_train)
- (preprocessed_neg_log_loss_cv.mean())
The output you get should be about 0.132, and there should no longer be a ConvergenceWarning
.
If you're not getting the right output, double check that you are applying the correct transformations to the correct variables:
X_t
should be scaled to createX_t_scaled
, thenX_t_scaled
should be resampled to createX_t_oversampled
, thenX_t_oversampled
should be used to fit the modelX_val
should be scaled to createX_val_scaled
, thenX_val_scaled
should be used to evaluateneg_log_loss
y_t
should be resampled to createy_t_oversampled
, theny_t_oversampled
should be used to fit the modely_val
should not be transformed in any way. It should just be used to evaluateneg_log_loss
Another thing to check is that you used sampling_strategy=0.28
when you instantiated the SMOTE
object.
If you are getting the right output, great! Let's compare that to our baseline log loss:
# Run this cell without changes
print(-baseline_neg_log_loss_cv.mean())
print(-preprocessed_neg_log_loss_cv.mean())
Looks like our preprocessing with StandardScaler
and SMOTE
has provided some improvement over the baseline! Let's move on to Step 4.
Now that we have applied appropriate preprocessing steps to our data in our custom cross validation function, we can reuse that function to test multiple different LogisticRegression
models.
For each model iteration, make sure you specify class_weight={1: 0.28}
, because this aligns with the weighting created by our SMOTE
process.
One of the first questions to ask when you start iterating on any model is: are we overfitting? Many of the models you will learn during this course will have built-in functionality to reduce overfitting.
To determine whether we are overfitting, let's examine the training scores vs. the validation scores from our existing modeling process:
# Run this cell without changes
print("Train: ", -preprocessed_train_scores)
print("Validation:", -preprocessed_neg_log_loss_cv)
Remember that these are loss metrics, meaning lower is better. Are we overfitting?
# Replace None with appropriate text
"""
None
"""
It's actually possible that we are underfitting due to too high of regularization. Remember that LogisticRegression
from scikit-learn has regularization by default
# Run this cell without changes
model_with_preprocessing.get_params()
That first key-value pair, 'C': 1.0
, specifies the regularization strength. As is noted in the scikit-learn LogisticRegression
docs, C
is:
Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.
In general if you are increasing C
you want to increase it by orders of magnitude. I.e. not increasing it to 1.1, but rather increasing it to 1e3, 1e5, etc.
In the cell below, instantiate a LogisticRegression
model with lower regularization (i.e. higher C
), along with random_state=42
and class_weight={1: 0.28}
. Call this model model_less_regularization
.
# Replace None with appropriate code
model_less_regularization = None
This code cell double-checks that the model was created correctly:
# Run this cell without changes
# Check variable type
assert type(model_less_regularization) == LogisticRegression
# Check params
assert model_less_regularization.get_params()["random_state"] == 42
assert model_less_regularization.get_params()["class_weight"] == {1: 0.28}
assert model_less_regularization.get_params()["C"] != 1.0
Now, evaluate that model using custom_cross_val_score
. Recall that custom_cross_val_score
takes 3 arguments: estimator
, X
, and y
. In this case, estimator
should be model_less_regularization
, X
should be X_train
, and y
should be y_train
.
# Replace None with appropriate code
less_regularization_train_scores, less_regularization_val_scores = None
print("Previous Model")
print("Train average: ", -preprocessed_train_scores.mean())
print("Validation average:", -preprocessed_neg_log_loss_cv.mean())
print("Current Model")
print("Train average: ", -less_regularization_train_scores.mean())
print("Validation average:", -less_regularization_val_scores.mean())
Your answers will vary somewhat depending on the value of C
that you chose, but in general you should see a slight improvement, from something like 0.132358 validation average to 0.132344 (improvement of .000014). Not a massive difference but it is an improvement!
Right now we are using the default solver and type of regularization penalty:
# Run this cell without changes
print("solver:", model_less_regularization.get_params()["solver"])
print("penalty:", model_less_regularization.get_params()["penalty"])
What if we want to try a different kind of regularization penalty?
From the docs:
- ‘newton-cg’, ‘lbfgs’, ‘sag’ and ‘saga’ handle L2 or no penalty
- ‘liblinear’ and ‘saga’ also handle L1 penalty
- ‘saga’ also supports ‘elasticnet’ penalty
In other words, the only models that support L1 or elastic net penalties are liblinear
and saga
. liblinear
is going to be quite slow with the size of our dataset, so let's use saga
.
In the cell below, create a model that uses solver="saga"
and penalty="elasticnet"
. Then use the l1_ratio
argument to specify the mixing of L1 and L2 regularization. You want a value greater than zero (zero would just mean using L2 regularization) and less than one (one would mean using just L1 regularization).
Remember to also specify random_state=42
, class_weight={1: 0.28}
, and C
equals the value you previously used.
# Replace None with appropriate code
model_alternative_solver = None
alternative_solver_train_scores, alternative_solver_val_scores = custom_cross_val_score(
model_alternative_solver,
X_train,
y_train
)
print("Previous Model (Less Regularization)")
print("Train average: ", -less_regularization_train_scores.mean())
print("Validation average:", -less_regularization_val_scores.mean())
print("Current Model")
print("Train average: ", -alternative_solver_train_scores.mean())
print("Validation average:", -alternative_solver_val_scores.mean())
Most likely you started getting ConvergenceWarning
s again, even though we are scaling the data inside of custom_cross_val_score
. When you get a convergence warning in a case like this, you want to modify the tol
and/or max_iter
parameters.
If you are getting good results (good metrics) but are still getting a ConvergenceWarning
, consider increasing the tolerance (tol
argument). The tolerance specifies how close to zero the gradient must be in order to stop taking additional steps. It's possible that your model is finding a gradient that is close enough to zero, but slightly above the default tolerance, if everything otherwise looks good.
In this case, we are getting slightly worse metrics on both the train and the validation data (compared to a different solver strategy), so increasing the number of iterations (max_iter
) seems like a better strategy. Essentially this is allowing the gradient descent algorithm to take more steps as it tries to find an optimal solution.
In the cell below, create a model called model_more_iterations
that has the same hyperparameters as model_alternative_solver
, with the addition of an increased max_iter
. You'll need to increase max_iter
significantly to a number in the thousands.
Note: As you increase max_iter
, it is normal for the execution time of fitting the model to increase. The following cell may take up to several minutes to execute. Try to be patient with this exercise! If this step times out, you can just read on ahead.
# Replace None with appropriate code
model_more_iterations = None
more_iterations_train_scores, more_iterations_val_scores = custom_cross_val_score(
model_more_iterations,
X_train,
y_train
)
print("Previous Best Model (Less Regularization)")
print("Train average: ", -less_regularization_train_scores.mean())
print("Validation average:", -less_regularization_val_scores.mean())
print("Previous Model with This Solver")
print("Train average: ", -alternative_solver_train_scores.mean())
print("Validation average:", -alternative_solver_val_scores.mean())
print("Current Model")
print("Train average: ", -more_iterations_train_scores.mean())
print("Validation average:", -more_iterations_val_scores.mean())
The results you got are most likely around 0.13241, whereas the previous model was around 0.13234. In other words, even after waiting all that time, we are getting 0.00007 worse log loss with this solver.
This is a fairly typical experience when hyperparameter tuning! Often the default hyperparameters are the default because they work best in the most situations. This is especially true of logistic regression, which has relatively few hyperparameters. Once we get to more complex models, there are more "levers to pull" (hyperparameters to adjust) so it is more likely that we'll improve performance by deviating from the default.
Let's declare the model_less_regularization
to be our best model, and move on to the final evaluation phase.
# Run this cell without changes
final_model = model_less_regularization
In order to evaluate our final model, we need to preprocess the full training and test data, fit the model on the full training data, and evaluate it on the full test data. Initially we'll continue to use log loss for the evaluation step.
# Replace None with appropriate code
# Instantiate StandardScaler
scaler = None
# Fit and transform X_train
X_train_scaled = None
# Transform X_test
X_test_scaled = None
# Instantiate SMOTE with random_state=42 and sampling_strategy=0.28
sm = None
# Fit and transform X_train_scaled and y_train using sm
X_train_oversampled, y_train_oversampled = None
# Run this cell without changes
final_model.fit(X_train_oversampled, y_train_oversampled)
# Run this cell without changes
log_loss(y_test, final_model.predict_proba(X_test_scaled))
Great! We are getting slightly better performance when we train the model with the full training set, compared to the average cross-validated performance. This is typical since models tend to perform better with more training data.
This model has improved log loss compared to our initial baseline model, which had about 0.172.
But we're not quite done here!
If we wanted to present this forest cover classification model to non-technical stakeholders, log loss would be a confusing choice. Let's compute some other metrics that tell the story of our model's performance in a more interpretable way.
Although we noted the issues with accuracy as a metric on unbalanced datasets, accuracy is a very intuitive metric. Recall that we would expect an accuracy of about 0.928651 if we identified every cell as class 0. What accuracy do we get with our new model?
(Note that we used .predict_proba
above, but accuracy uses .predict
)
# Replace None with appropriate code
from sklearn.metrics import accuracy_score
accuracy_score(None, None)
In other words, our model correctly identifies the type of forest cover about 94.6% of the time, whereas always guessing the majority class (ponderosa pine) would only be accurate about 92.9% of the time.
If we always chose the majority class, we would expect a precision of 0, since we would never identify any "true positives". What is the precision of our final model? Use precision_score
from scikit-learn (documentation here).
# Replace None with appropriate code
# Import the relevant function
None
# Display the precision score
None
In other words, if our model labels a given cell of forest as class 1, there is about a 66.6% chance that it is actually class 1 (cottonwood/willow) and about a 33.4% chance that it is actually class 0 (ponderosa pine).
Again, if we always chose the majority class, we would expect a recall of 0. What is the recall of our final model? Use recall_score
from scikit-learn (documentation here).
# Replace None with appropriate code
# Import the relevant function
None
# Display the recall score
None
In other words, if a given cell of forest is actually class 1, there is about a 47.9% chance that our model will correctly label it as class 1 (cottonwood/willow) and about a 52.1% chance that our model will incorrectly label it as class 0 (ponderosa pine).
Depending on the stakeholder, you most likely want to report just precision or just recall. Try to understand their business case:
- If false positives are a bigger problem (labeled cottonwood/willow when it's really ponderosa pine), precision is the important metric to report
- If false negatives are a bigger problem (labeled ponderosa pine when it's really cottonwood/willow), recall is the important metric to report
If those problems have truly equal importance, you could report an f1-score instead, although this is somewhat more difficult for the average person to interpret.
If either of those problems is important enough that it outweighs overall accuracy, you could also adjust the decision threshold of your final model to improve the metric that matters most. Let's say that it's important to improve the recall score — that we want to be able to correctly label more of the cottonwood/willow trees as cottonwood/willow trees, even if that means accidentally labeling more ponderosa pine as cottonwood/willow incorrectly.
Then we can use .predict_proba
to err on the side of the positive class. Let's use an exaggerated example, which assumes that false negatives are a very significant problem. Instead of using the default 50% threshold (where a probability over 0.5 is classified as positive) let's say that if there is greater than a 1% chance it's positive, we classify it as positive:
(If the opposite issue were the case — it's very important that every area classified as 1 is actually cottonwood/willow — you would want the threshold to be higher than 50% rather than lower than 50%.)
# Run this cell without changes
def final_model_func(model, X):
"""
Custom function to predict probability of
cottonwood/willow. If the model says there
is >1% chance, we label it as class 1
"""
probs = model.predict_proba(X)[:,1]
return [int(prob > 0.01) for prob in probs]
# Calculate predictions with adjusted threshold and
# display proportions of predictions
threshold_adjusted_probs = pd.Series(final_model_func(final_model, X_test_scaled))
threshold_adjusted_probs.value_counts(normalize=True)
So, now we are predicting that everything over a 1% chance of being class 1 as class 1, which means that we're classifying about 58.6% of the records as class 0 and 41.4% of the records as class 1.
# Run this cell without changes
print("Accuracy:", accuracy_score(y_test, threshold_adjusted_probs))
print("Recall: ", recall_score(y_test, threshold_adjusted_probs))
This means that we are able to identify 99.1% of the true positives (i.e. 99.1% of the cottonwood/willow cells are identified). However this comes at a cost; our overall accuracy is now 65.7% instead of over 90%.
So we are classifying over 40% of the cells as cottonwood/willow, even though fewer than 10% of the cells are actually that category, in order to miss as few true positives as possible. Even though this seems fairly extreme, our model is still better than just choosing class 1 every time (that model would have about 7% accuracy).
This kind of model might be useful if there is some kind of treatment needed for cottonwood/willow trees, but your organization only has the resources to visit fewer than half of the study areas. This model would allow them to visit 41% of the areas and successfully treat over 99% of the cottonwood/willow trees.
You can also imagine a less-drastic version of this threshold adjusting, where you trade off a marginal improvement in precision or recall for a marginal reduction in accuracy. Visually inspecting the precision-recall curve (documentation here) can help you choose the threshold based on what you want to optimize.
In this lab, you completed an end-to-end machine learning modeling process with logistic regression on an imbalanced dataset. First you built and evaluated a baseline model. Next you wrote a custom cross validation function in order to use SMOTE resampling appropriately (without needing an imblearn
pipeline). After that, you tuned the model through adjusting the regularization strength and the gradient descent hyperparameters. Finally, you evaluated your final model on log loss as well as more user-friendly metrics.