Inconsistent Performance and Loss when Resuming Training
Closed this issue ยท 4 comments
Thank you for your excellent work. ๐
We have observed that whenever we resume training with a different number of epochs after training completion, the loaded historical model exhibits significantly lower accuracy compared to the corresponding epoch during the original training. For instance, when loading a model trained for 100 epochs, its performance is only comparable to that of a model trained for 30 epochs.
This inconsistency in performance after resuming training poses a challenge for us to continue training from a checkpoint and obtain the desired results.
config.yaml
aug_prob: 0.2
augmentations:
args:
aug_prob: <aug_prob>
noise_file: <noise>
reverb_file: <reverb>
obj: speakerlab.process.processor.SpkVeriAug
batch_size: 256
checkpointer:
args:
checkpoints_dir: <exp_dir>/models
recoverables:
classifier: <classifier>
embedding_model: <embedding_model>
epoch_counter: <epoch_counter>
obj: speakerlab.utils.checkpoint.Checkpointer
classifier:
args:
input_dim: <embedding_size>
out_neurons: <num_classes>
obj: speakerlab.models.campplus.classifier.CosineClassifier
data: data/vox2_dev/train.csv
dataloader:
args:
batch_size: <batch_size>
dataset: <dataset>
drop_last: true
num_workers: <num_workers>
pin_memory: true
obj: torch.utils.data.DataLoader
dataset:
args:
data_file: <data>
preprocessor: <preprocessor>
obj: speakerlab.dataset.dataset.WavSVDataset
embedding_model:
args:
embed_dim: <embedding_size>
feat_dim: <fbank_dim>
num_blocks:
- 3
- 3
- 9
- 3
pooling_func: GSP
obj: speakerlab.models.dfresnet.resnet.DFResNet
embedding_size: 512
epoch_counter:
args:
limit: <num_epoch>
obj: speakerlab.utils.epoch.EpochCounter
exp_dir: exp/dfresnet56
fbank_dim: 80
feature_extractor:
args:
mean_nor: true
n_mels: <fbank_dim>
sample_rate: <sample_rate>
obj: speakerlab.process.processor.FBank
label_encoder:
args:
data_file: <data>
obj: speakerlab.process.processor.SpkLabelEncoder
log_batch_freq: 100
loss:
args:
easy_margin: false
margin: 0.2
scale: 32.0
obj: speakerlab.loss.margin_loss.ArcMarginLoss
lr: 0.1
lr_scheduler:
args:
fix_epoch: <num_epoch>
max_lr: <lr>
min_lr: <min_lr>
optimizer: <optimizer>
step_per_epoch: null
warmup_epoch: 5
obj: speakerlab.process.scheduler.WarmupCosineScheduler
margin_scheduler:
args:
criterion: <loss>
final_margin: 0.2
fix_epoch: 25
increase_start_epoch: 15
initial_margin: 0.0
step_per_epoch: null
obj: speakerlab.process.scheduler.MarginScheduler
min_lr: 0.0001
noise: data/musan/wav.scp
num_classes: 5994
num_epoch: 200
num_workers: 16
optimizer:
args:
lr: <lr>
momentum: 0.9
nesterov: true
params: null
weight_decay: 0.0001
obj: torch.optim.SGD
preprocessor:
augmentations: <augmentations>
feature_extractor: <feature_extractor>
label_encoder: <label_encoder>
wav_reader: <wav_reader>
reverb: data/rirs/wav.scp
sample_rate: 16000
save_epoch_freq: 2
speed_pertub: true
wav_len: 3.0
wav_reader:
args:
duration: <wav_len>
sample_rate: <sample_rate>
speed_pertub: <speed_pertub>
obj: speakerlab.process.processor.WavReader
@nuaazs The 'lr_scheduler' is a warmup cosine scheduler. If you only set the 'num_epoch' to 200 and then resume training, the learning rate will increase at epoch 100. To avoid this, I recommend adjusting the 'lr_scheduler' configuration to maintain a lower lr value. Alternatively, you may simply need to wait and train for more epochs to achieve optimal performance.
Thank you for your response. @wanghuii1
It seems that the poor results are indeed caused by the learning rate being set too high.
Everything works fine after I re-modified the learning rate.