Object detection - best weights never saved
cypamigon opened this issue · 4 comments
Hello,
I'm trying to train an object detection model based on a custom dataset. I'm following the instructions provided in the README of the object_detection/src folder.
I've modified the user_config.yaml
file according to my need and I'm running the training script with python stm32ai_main.py
.
According to the instructions, best model weights since the beginning of the training should be automatically saved on the /experiments_outputs/"%Y_%m_%d_%H_%M_%S"/saved_models/
folder. However, the weights are never saved during the training (no best_weights.h5
in the folder).
At the end of the training process, when the scripts want to load the weights, an error is raised because the path doesn't exist !
I've tried to modify the keras.callbacks.ModelCheckpoint parameters to saved the weights at the end of each epoch (even if they are not the best) and it works (best_weights.h5
are saved in the saved_models folder).*
I've replace :
# Add the Keras callback that saves the best model obtained so far
callback = tf.keras.callbacks.ModelCheckpoint(
filepath= os.path.join(output_dir, saved_models_dir, model_file_name),
save_best_only=True,
save_weights_only=save_only_weights, #save_only_weights = True
monitor="val_loss",
mode="min")
callback_list.append(callback)
with :
# Add the Keras callback that saves the best model obtained so far
callback = tf.keras.callbacks.ModelCheckpoint(
filepath= os.path.join(output_dir, saved_models_dir, model_file_name),
save_best_only=False,
save_weights_only=save_only_weights, #Tsave_only_weights = True
monitor="val_loss",
mode="min")
callback_list.append(callback)
However, I would like to save the best weights since the begining of the training in order to get the more efficient model. Do you have any idea on what could prevent the script to save the best_weights.h5
file when save_best_only
parameter is set to True
?
I'm running the script on Windows 10 and in a st_zoo
virtual env as detailled in the repository README.
Here is my user_config.yaml
file :
general:
project_name: Cup_Detection
model_type: ssd_mobilenet_v2_fpnlite
model_path: ../pretrained_models/ssd_mobilenet_v2_fpnlite/ST_pretrainedmodel_public_dataset/coco_2017_person/ssd_mobilenet_v2_fpnlite_035_416/ssd_mobilenet_v2_fpnlite_035_416.h5 #../pretrained_models/ssd_mobilenet_v2_fpnlite/ST_pretrainedmodel_public_dataset/coco_2017_person/ssd_mobilenet_v2_fpnlite_035_416/ssd_mobilenet_v2_fpnlite_035_416_int8.tflite
logs_dir: logs
saved_models_dir: saved_models
gpu_memory_limit: 16
global_seed: 127
operation_mode: chain_tqe
#choices=['training' , 'evaluation', 'deployment', 'quantization', 'benchmarking',
# 'chain_tqeb','chain_tqe','chain_eqe','chain_qb','chain_eqeb','chain_qd ']
dataset:
name: custom_cup_dataset
class_names: [ cup ]
training_path: ../datasets/cup_images_dataset/train
validation_path: ../datasets/cup_images_dataset/val
test_path: ../datasets/cup_images_dataset/test
quantization_path:
quantization_split: 0.3
preprocessing:
rescaling: { scale: 1/127.5, offset: -1 }
resizing:
aspect_ratio: fit
interpolation: nearest
color_mode: rgb
data_augmentation:
rotation: 30
shearing: 15
translation: 0.1
vertical_flip: 0.5
horizontal_flip: 0.2
gaussian_blur: 3.0
linear_contrast: [ 0.75, 1.5 ]
training:
model:
alpha: 0.35
input_shape: (416, 416, 3)
pretrained_weights: imagenet
dropout:
batch_size: 64
epochs: 5000
optimizer:
Adam:
learning_rate: 0.001
callbacks:
ReduceLROnPlateau:
monitor: val_loss
patience: 20
EarlyStopping:
monitor: val_loss
patience: 40
postprocessing:
confidence_thresh: 0.6
NMS_thresh: 0.5
IoU_eval_thresh: 0.3
plot_metrics: True # Plot precision versus recall curves. Default is False.
max_detection_boxes: 10
quantization:
quantizer: TFlite_converter
quantization_type: PTQ
quantization_input_type: float
quantization_output_type: uint8
export_dir: quantized_models
benchmarking:
board: STM32H747I-DISCO
tools:
stm32ai:
version: 8.1.0
optimization: balanced
on_cloud: True
path_to_stm32ai: C:/Users/<XXXXX>/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/<*.*.*>/Utilities/windows/stm32ai.exe
path_to_cubeIDE: C:/ST/STM32CubeIDE_1.10.1/STM32CubeIDE/stm32cubeide.exe
deployment:
c_project_path: ../../stm32ai_application_code/object_detection/
IDE: GCC
verbosity: 1 n
hardware_setup:
serie: STM32H7
board: STM32H747I-DISCO
mlflow:
uri: ./experiments_outputs/mlruns
hydra:
run:
dir: ./experiments_outputs/${now:%Y_%m_%d_%H_%M_%S}
Hello Cypamigon,
After some investigations with the provided yaml file we couldn't replicate the issue regarding best_weights.h5
not being present in /experiments_outputs/"%Y_%m_%d_%H_%M_%S"/saved_models/
.
Because you are on Windows maybe you forgot to change the 256 characters maximum path length.
To change this you can follow instructions in the TIP
section of the main README
(at the end).
Thanks,
Thanks for your quick feedback.
Unfortunately, I've already enabled windows long path support. I've tried to change the output path but it behave the same.
Ok, another explanation could be that the ssd_mobilenet_v2_fpnlite_035_416.h5
model we provide, trained on person detection
kept the information about its previous training especially the best val_loss
.
And when you try to save the best_weights.h5
it does not save anything because the new val_loss
of your training is higher then the best val_loss
.
If this is true a workaround could be -> for just 1 epoch put save_best_only=False
then stop the training, use the best_weights.h5
of this training (best_weights.h5
in general.model_path
section) to launch another training but with save_best_only=True
this time.
Thanks,
Hmm, okay, looks promising. I'm currently running a training session with save_best_only=False
. I'll try your solution once it finishes.
Thanks!