In this lab, you'll use a train-test partition as well as a validation set to get better insights about how to tune neural networks using regularization techniques. You'll start by repeating the process from the last section: importing the data and performing preprocessing including one-hot encoding. From there, you'll define and compile the model like before.
You will be able to:
- Apply early stopping criteria with a neural network
- Apply L1, L2, and dropout regularization on a neural network
- Examine the effects of training with more data on a neural network
Run the following cell to import some of the libraries and classes you'll need in this lab.
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from keras.utils.np_utils import to_categorical
from sklearn.preprocessing import LabelBinarizer
from keras.preprocessing.text import Tokenizer
import warnings
warnings.filterwarnings('ignore')
The data is stored in the file 'Bank_complaints.csv'
. Load and preview the dataset.
# Load and preview the dataset
df = None
Before you begin to practice some of your new tools such as regularization and optimization, let's practice munging some data as you did in the previous section with bank complaints. Recall some techniques:
- Sampling in order to reduce training time (investigate model accuracy vs data size later on)
- Train - test split
- One-hot encoding your complaint text
- Transforming your category labels
Since you have quite a bit of data and training neural networks takes a substantial amount of time and resources, downsample in order to test your initial pipeline. Going forward, these can be interesting areas of investigation: how does your model's performance change as you increase (or decrease) the size of your dataset?
- Generate a random sample of 10,000 observations using seed 123 for consistency of results.
- Split this sample into
X
andy
# Downsample the data
df_sample = None
# Split the data into X and y
y = None
X = None
- Split the data into training and test sets
- Assign 1500 obervations to the test set and use 42 as the seed
# Split data into training and test sets
X_train, X_test, y_train, y_test = None
As mentioned in the previous lesson, it is good practice to set aside a validation set, which is then used during hyperparameter tuning. Afterwards, when you have decided upon a final model, the test set can then be used to determine an unbiased perforance of the model.
Run the cell below to further divide the training data into training and validation sets.
# Split the data into training and validation sets
X_train_final, X_val, y_train_final, y_val = train_test_split(X_train, y_train, test_size=1000, random_state=42)
As before, you need to do some preprocessing before building a neural network model.
- Keep the 2,000 most common words and use one-hot encoding to reformat the complaints into a matrix of vectors
- Transform the training, validate, and test sets
# Use one-hot encoding to reformat the complaints into a matrix of vectors
# Only keep the 2000 most common words
tokenizer = None
X_train_tokens = None
X_val_tokens = None
X_test_tokens = None
Similarly, now transform the descriptive product labels to integers labels. After transforming them to integer labels, retransform them into a matrix of binary flags, one for each of the various product labels.
Note: This is similar to your previous work with dummy variables. Each of the various product categories will be its own column, and each observation will be a row. In turn, each of these observation rows will have a 1 in the column associated with it's label, and all other entries for the row will be zero.
Transform the training, validate, and test sets.
# Transform the product labels to numerical values
lb = None
y_train_lb = None
y_val_lb = None
y_test_lb = None
Rebuild a fully connected (Dense) layer network:
- Use 2 hidden layers with 50 units in the first and 25 in the second layer, both with
'relu'
activation functions (since you are dealing with a multiclass problem, classifying the complaints into 7 classes) - Use a
'softmax'
activation function for the output layer
# Build a baseline neural network model using Keras
random.seed(123)
from keras import models
from keras import layers
baseline_model = None
Compile this model with:
- a stochastic gradient descent optimizer
'categorical_crossentropy'
as the loss function- a focus on
'accuracy'
# Compile the model
- Train the model for 150 epochs in mini-batches of 256 samples
- Include the
validation_data
argument to ensure you keep track of the validation loss
# Train the model
baseline_model_val = None
The attribute .history
(stored as a dictionary) contains four entries now: one per metric that was being monitored during training and validation. Print the keys of this dictionary for confirmation:
# Access the history attribute and store the dictionary
baseline_model_val_dict = None
# Print the keys
Evaluate this model on the training data:
results_train = None
print('----------')
print(f'Training Loss: {results_train[0]:.3} \nTraining Accuracy: {results_train[1]:.3}')
Evaluate this model on the test data:
results_test = None
print('----------')
print(f'Test Loss: {results_test[0]:.3} \nTest Accuracy: {results_test[1]:.3}')
Plot the loss versus the number of epochs. Be sure to include the training and the validation loss in the same plot.
# Loss vs number of epochs with train and validation sets
Create a second plot comparing training and validation accuracy to the number of epochs.
# Accuracy vs number of epochs with train and validation sets
Did you notice an interesting pattern here? Although the training accuracy keeps increasing when going through more epochs, and the training loss keeps decreasing, the validation accuracy and loss don't necessarily do the same. After a certain point, validation accuracy keeps swinging, which means that you're probably overfitting the model to the training data when you train for many epochs past a certain dropoff point. Let's tackle this now. You will now specify an early stopping point when training your model.
Overfitting neural networks is something you want to avoid at all costs. However, it's not possible to know in advance how many epochs you need to train your model on, and running the model multiple times with varying number of epochs maybe helpful, but is a time-consuming process.
We've defined a model with the same architecture as above. This time specify an early stopping point when training the model.
random.seed(123)
model_2 = models.Sequential()
model_2.add(layers.Dense(50, activation='relu', input_shape=(2000,)))
model_2.add(layers.Dense(25, activation='relu'))
model_2.add(layers.Dense(7, activation='softmax'))
model_2.compile(optimizer='SGD',
loss='categorical_crossentropy',
metrics=['accuracy'])
- Import
EarlyStopping
andModelCheckpoint
fromkeras.callbacks
- Define a list,
early_stopping
:- Monitor
'val_loss'
and continue training for 10 epochs before stopping - Save the best model while monitoring
'val_loss'
- Monitor
If you need help, consult documentation.
# Import EarlyStopping and ModelCheckpoint
# Define the callbacks
early_stopping = None
Train model_2
. Make sure you set the callbacks
argument to early_stopping
.
model_2_val = None
Load the best (saved) model.
# Load the best (saved) model
saved_model = None
Now, use this model to to calculate the training and test accuracy:
results_train = saved_model.evaluate(X_train_tokens, y_train_lb)
print(f'Training Loss: {results_train[0]:.3} \nTraining Accuracy: {results_train[1]:.3}')
print('----------')
results_test = saved_model.evaluate(X_test_tokens, y_test_lb)
print(f'Test Loss: {results_test[0]:.3} \nTest Accuracy: {results_test[1]:.3}')
Nicely done! Did you notice that the model didn't train for all 150 epochs? You reduced your training time.
Now, take a look at how regularization techniques can further improve your model performance.
First, take a look at L2 regularization. Keras makes L2 regularization easy. Simply add the kernel_regularizer=keras.regularizers.l2(lambda_coeff)
parameter to any model layer. The lambda_coeff
parameter determines the strength of the regularization you wish to perform.
- Use 2 hidden layers with 50 units in the first and 25 in the second layer, both with
'relu'
activation functions - Add L2 regularization to both the hidden layers with 0.005 as the
lambda_coeff
# Import regularizers
random.seed(123)
L2_model = models.Sequential()
# Add the input and first hidden layer
# Add another hidden layer
# Add an output layer
L2_model.add(layers.Dense(7, activation='softmax'))
# Compile the model
L2_model.compile(optimizer='SGD',
loss='categorical_crossentropy',
metrics=['accuracy'])
# Train the model
L2_model_val = L2_model.fit(X_train_tokens,
y_train_lb,
epochs=150,
batch_size=256,
validation_data=(X_val_tokens, y_val_lb))
Now, look at the training as well as the validation accuracy for both the L2 and the baseline models.
# L2 model details
L2_model_dict = L2_model_val.history
L2_acc_values = L2_model_dict['acc']
L2_val_acc_values = L2_model_dict['val_acc']
# Baseline model
baseline_model_acc = baseline_model_val_dict['acc']
baseline_model_val_acc = baseline_model_val_dict['val_acc']
# Plot the accuracy for these models
fig, ax = plt.subplots(figsize=(12, 8))
epochs = range(1, len(acc_values) + 1)
ax.plot(epochs, L2_acc_values, label='Training acc L2')
ax.plot(epochs, L2_val_acc_values, label='Validation acc L2')
ax.plot(epochs, baseline_model_acc, label='Training acc')
ax.plot(epochs, baseline_model_val_acc, label='Validation acc')
ax.set_title('Training & validation accuracy L2 vs regular')
ax.set_xlabel('Epochs')
ax.set_ylabel('Loss')
ax.legend();
The results of L2 regularization are quite disappointing here. Notice the discrepancy between validation and training accuracy seems to have decreased slightly, but the end result is definitely not getting better.
Now have a look at L1 regularization. Will this work better?
- Use 2 hidden layers with 50 units in the first and 25 in the second layer, both with
'relu'
activation functions - Add L1 regularization to both the hidden layers with 0.005 as the
lambda_coeff
random.seed(123)
L1_model = models.Sequential()
# Add the input and first hidden layer
# Add a hidden layer
# Add an output layer
L1_model.add(layers.Dense(7, activation='softmax'))
# Compile the model
L1_model.compile(optimizer='SGD',
loss='categorical_crossentropy',
metrics=['accuracy'])
# Train the model
L1_model_val = L1_model.fit(X_train_tokens,
y_train_lb,
epochs=150,
batch_size=256,
validation_data=(X_val_tokens, y_val_lb))
Plot the training as well as the validation accuracy for the L1 model:
fig, ax = plt.subplots(figsize=(12, 8))
L1_model_dict = L1_model_val.history
acc_values = L1_model_dict['acc']
val_acc_values = L1_model_dict['val_acc']
epochs = range(1, len(acc_values) + 1)
ax.plot(epochs, acc_values, label='Training acc L1')
ax.plot(epochs, val_acc_values, label='Validation acc L1')
ax.set_title('Training & validation accuracy with L1 regularization')
ax.set_xlabel('Epochs')
ax.set_ylabel('Loss')
ax.legend();
Notice how the training and validation accuracy don't diverge as much as before. Unfortunately, the validation accuracy isn't still that good. Next, experiment with dropout regularization to see if it offers any advantages.
It's time to try another technique: applying dropout to layers. As discussed in the earlier lesson, this involves setting a certain proportion of units in each layer to zero. In the following cell:
- Apply a dropout rate of 30% to the input layer
- Add a first hidden layer with 50 units and
'relu'
activation - Apply a dropout rate of 30% to the first hidden layer
- Add a second hidden layer with 25 units and
'relu'
activation - Apply a dropout rate of 30% to the second hidden layer
# ⏰ This cell may take about a minute to run
random.seed(123)
dropout_model = models.Sequential()
# Implement dropout to the input layer
# NOTE: This is where you define the number of units in the input layer
# Add the first hidden layer
# Implement dropout to the first hidden layer
# Add the second hidden layer
# Implement dropout to the second hidden layer
# Add the output layer
dropout_model.add(layers.Dense(7, activation='softmax'))
# Compile the model
dropout_model.compile(optimizer='SGD',
loss='categorical_crossentropy',
metrics=['accuracy'])
# Train the model
dropout_model_val = dropout_model.fit(X_train_tokens,
y_train_lb,
epochs=150,
batch_size=256,
validation_data=(X_val_tokens, y_val_lb))
results_train = model.evaluate(X_train_tokens, y_train_lb)
print(f'Training Loss: {results_train[0]:.3} \nTraining Accuracy: {results_train[1]:.3}')
print('----------')
results_test = model.evaluate(X_test_tokens, y_test_lb)
print(f'Test Loss: {results_test[0]:.3} \nTest Accuracy: {results_test[1]:.3}')
You can see here that the validation performance has improved again, and the training and test accuracy are very close!
Finally, let's examine if we can improve the model's performance just by adding more data. We've quadrapled the sample dataset from 10,000 to 40,000 observations, and all you need to do is run the code!
df_bigger_sample = df.sample(40000, random_state=123)
X = df['Consumer complaint narrative']
y = df['Product']
# Train-test split
X_train_bigger, X_test_bigger, y_train_bigger, y_test_bigger = train_test_split(X,
y,
test_size=6000,
random_state=42)
# Validation set
X_train_final_bigger, X_val_bigger, y_train_final_bigger, y_val_bigger = train_test_split(X_train_bigger,
y_train_bigger,
test_size=4000,
random_state=42)
# One-hot encoding of the complaints
tokenizer = Tokenizer(num_words=2000)
tokenizer.fit_on_texts(X_train_final_bigger)
X_train_tokens_bigger = tokenizer.texts_to_matrix(X_train_final_bigger, mode='binary')
X_val_tokens_bigger = tokenizer.texts_to_matrix(X_val_bigger, mode='binary')
X_test_tokens_bigger = tokenizer.texts_to_matrix(X_test_bigger, mode='binary')
# One-hot encoding of products
lb = LabelBinarizer()
lb.fit(y_train_final_bigger)
y_train_lb_bigger = to_categorical(lb.transform(y_train_final_bigger))[:, :, 1]
y_val_lb_bigger = to_categorical(lb.transform(y_val_bigger))[:, :, 1]
y_test_lb_bigger = to_categorical(lb.transform(y_test_bigger))[:, :, 1]
# ⏰ This cell may take several minutes to run
random.seed(123)
bigger_data_model = models.Sequential()
bigger_data_model.add(layers.Dense(50, activation='relu', input_shape=(2000,)))
bigger_data_model.add(layers.Dense(25, activation='relu'))
bigger_data_model.add(layers.Dense(7, activation='softmax'))
bigger_data_model.compile(optimizer='SGD',
loss='categorical_crossentropy',
metrics=['accuracy'])
bigger_data_model_val = bigger_data_model.fit(X_train_tokens_bigger,
y_train_lb_bigger,
epochs=150,
batch_size=256,
validation_data=(X_val_tokens_bigger, y_val_lb_bigger))
results_train = bigger_data_model.evaluate(X_train_tokens_bigger, y_train_lb_bigger)
print(f'Training Loss: {results_train[0]:.3} \nTraining Accuracy: {results_train[1]:.3}')
print('----------')
results_test = bigger_data_model.evaluate(X_val_tokens_bigger, y_val_lb_bigger)
print(f'Test Loss: {results_test[0]:.3} \nTest Accuracy: {results_test[1]:.3}')
With the same amount of epochs and no regularization technique, you were able to get both better test accuracy and loss. You can still consider early stopping, L1, L2 and dropout here. It's clear that having more data has a strong impact on model performance!
- https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/Consumer_complaints.ipynb
- https://machinelearningmastery.com/dropout-regularization-deep-learning-models-keras/
- https://catalog.data.gov/dataset/consumer-complaint-database
In this lesson, you built deep learning models using a validation set and used several techniques such as L2 and L1 regularization, dropout regularization, and early stopping to improve the accuracy of your models.