Checking training model is overfit or underfit
sangyo1 opened this issue · 1 comments
Prerequisites
Please answer the following question for yourself before submitting an issue.
- I checked to make sure that this feature has not been requested already.
1. The entire URL of the file you are using
2. Describe the feature you request
I am training my own model built from scratch with Mobilenet_v2 SSD 320x320. I had about 1000 images for training, 100 for validation. However, when I try to check if my model is overfitting or underfitting using tensorboard --logdir, it only shows the training loss, even though I have added the validation set as well. How can I check if my model is overfitting or underfitting?
3. Additional context
Here is my model.config
# SSD with Mobilenet v2 FPN-lite (go/fpn-lite) feature extractor, shared box
# predictor and focal loss (a mobile version of Retinanet).
# Retinanet: see Lin et al, https://arxiv.org/abs/1708.02002
# Trained on COCO, initialized from Imagenet classification checkpoint
# Train on TPU-8
#
# Achieves 22.2 mAP on COCO17 Val
model {
ssd {
inplace_batchnorm_update: true
freeze_batchnorm: false
num_classes: 7
box_coder {
faster_rcnn_box_coder {
y_scale: 10.0
x_scale: 10.0
height_scale: 5.0
width_scale: 5.0
}
}
matcher {
argmax_matcher {
matched_threshold: 0.5
unmatched_threshold: 0.5
ignore_thresholds: false
negatives_lower_than_unmatched: true
force_match_for_each_row: true
use_matmul_gather: true
}
}
similarity_calculator {
iou_similarity {
}
}
encode_background_as_zeros: true
anchor_generator {
multiscale_anchor_generator {
min_level: 3
max_level: 7
anchor_scale: 4.0
aspect_ratios: [1.0, 2.0, 0.5]
scales_per_octave: 2
}
}
image_resizer {
fixed_shape_resizer {
height: 320
width: 320
}
}
box_predictor {
weight_shared_convolutional_box_predictor {
depth: 128
class_prediction_bias_init: -4.6
conv_hyperparams {
activation: RELU_6,
regularizer {
l2_regularizer {
weight: 0.00004
}
}
initializer {
random_normal_initializer {
stddev: 0.01
mean: 0.0
}
}
batch_norm {
scale: true,
decay: 0.997,
epsilon: 0.001,
}
}
num_layers_before_predictor: 4
share_prediction_tower: true
use_depthwise: true
kernel_size: 3
}
}
feature_extractor {
type: 'ssd_mobilenet_v2_fpn_keras'
use_depthwise: true
fpn {
min_level: 3
max_level: 7
additional_layer_depth: 128
}
min_depth: 16
depth_multiplier: 1.0
conv_hyperparams {
activation: RELU_6,
regularizer {
l2_regularizer {
weight: 0.00004
}
}
initializer {
random_normal_initializer {
stddev: 0.01
mean: 0.0
}
}
batch_norm {
scale: true,
decay: 0.997,
epsilon: 0.001,
}
}
override_base_feature_extractor_hyperparams: true
}
loss {
classification_loss {
weighted_sigmoid_focal {
alpha: 0.25
gamma: 2.0
}
}
localization_loss {
weighted_smooth_l1 {
}
}
classification_weight: 1.0
localization_weight: 1.0
}
normalize_loss_by_num_matches: true
normalize_loc_loss_by_codesize: true
post_processing {
batch_non_max_suppression {
score_threshold: 1e-8
iou_threshold: 0.6
max_detections_per_class: 100
max_total_detections: 100
}
score_converter: SIGMOID
}
}
}
train_config: {
batch_size: 16
sync_replicas: true
startup_delay_steps: 0
replicas_to_aggregate: 8
num_steps: 120000
data_augmentation_options {
random_horizontal_flip {
}
}
data_augmentation_options {
random_crop_image {
min_object_covered: 0.0
min_aspect_ratio: 0.75
max_aspect_ratio: 3.0
min_area: 0.75
max_area: 1.0
overlap_thresh: 0.0
}
}
optimizer {
momentum_optimizer: {
learning_rate: {
cosine_decay_learning_rate {
learning_rate_base: .08
total_steps: 50000
warmup_learning_rate: .026666
warmup_steps: 1000
}
}
momentum_optimizer_value: 0.9
}
use_moving_average: false
}
max_number_of_boxes: 100
unpad_groundtruth_tensors: false
}
train_input_reader: {
label_map_path: "/home/ubuntu/ssl/workspace/dataset/support_post.pbtxt"
tf_record_input_reader {
input_path: "/home/ubuntu/ssl/workspace/dataset/train_posts_apr19.tfrecord"
}
}
eval_config: {
metrics_set: "coco_detection_metrics"
use_moving_averages: false
}
eval_input_reader: {
label_map_path: "/home/ubuntu/ssl/workspace/dataset/support_post.pbtxt"
shuffle: false
num_epochs: 1
tf_record_input_reader {
input_path: "/home/ubuntu/ssl/workspace/dataset/val_posts_apr19.tfrecord"
}
}
4. Are you willing to contribute it? (Yes or No)
First off your learning rate decayed to zero before the model finished training so there was no change in the model after 50K steps. As can be seen here in you pipeline.config:
`
cosine_decay_learning_rate {
learning_rate_base: .08
total_steps: 50000 # This should = total number of steps
warmup_learning_rate: .026666
warmup_steps: 1000
}
`
The best way to determine overfitting is to test you model on images that it has not trained upon. If the model shows low loss values but is unable to classify any/very few objects from images it has never seen, the model is overfitted.
In the case of underfitting you loss will be much larger during training and will not reduce. Using cosine decay should significantly reduce the possibility of underfitting, but selecting a appropriate learning rate base value is still important.
Also see here for details on how to evaluate a model: https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf2_training_and_evaluation.md