zzh8829/yolov3-tf2

model.fit() and eager_tf generates different training results

silvaurus opened this issue · 2 comments

Hello!

I didn't change the code and use both model.fit() and eager_tf to train the network.

For model.fit() the avg validation loss value is < 50 even in the first epoch. And the training loss value also goes < 50 in the beginning of the second epoch.

For eager_tf the validation loss stays at ~ 200 after 10 epochs, and the training loss decreases much slower, and goes to ~50 in the 10th epoch, which looks like overfitting.

This is the training result for model.fit():
Epoch 1:
1/358

  • loss: 9787.6289 - yolo_output_0_loss: 508.0005 - yolo_output_1_loss: 1342.9556 - yolo_output_2_loss: 7925.9561

...

357/358

  • loss: 378.2877 - yolo_output_0_loss: 22.6362 - yolo_output_1_loss: 49.9713 - yolo_output_2_loss: 294.6154

358/358

  • loss: 378.0025 - yolo_output_0_loss: 22.6236 - yolo_output_1_loss: 49.9357 - yolo_output_2_loss: 294.3785

val_loss: 51.9096 - val_yolo_output_0_loss: 8.8620 - val_yolo_output_1_loss: 7.8781 - val_yolo_output_2_loss: 24.0912

Epoch 2:
1/358

  • loss: 43.6244 - yolo_output_0_loss: 6.2404 - yolo_output_1_loss: 8.0534 - yolo_output_2_loss: 18.2523

Notice this sudden transition of training loss from 378 to 43 - this is because model.fit() reports the average among all the iterations in one batch.

This is the training result for eager_tf:
1_train_0, 155262.8125, [5675.242, 34116.484, 115460.375]
...

1_train_356, 523.5953369140625, [124.26721, 100.35405, 287.8407]
1_train_357, 125.0768814086914, [25.127472, 11.3394575, 77.47637]
1_val_0, 565.5044555664062, [86.86941, 158.40671, 309.0946]
...
1_val_363, 694.1661987304688, [114.45209, 213.89682, 354.6836]

(Average) 1, train: 5050.33447265625, val: 590.8134155273438

2_train_0, 788.0953369140625, [132.88559, 241.86014, 402.21585]
2_train_1, 493.3677978515625, [86.920746, 157.22601, 238.08711]

Notice that here the losses are per-iteration losses and are not averaged.
ever since the first iteration, the loss values are much bigger than model.fit(), and at the end of epoch 1, the loss is >100, which is much worse compared with < 50 in model.fit().

I strictly follow the tutorial used for training and used the datasets / darknet model downloaded directly from the links provided.

I guess this might relate to the different process of loss functions.
Do you by any chance know why?

My current guess is that in eager_tf mode, the total_losses (minus regularization loss) are not divided by the batch size.

You have to ensure these two methods print the same thing.First, It seems that your loss has not average. Second, their data batch is different,"model.fit()"maybe use "random batch", the other one use all the batch,not random,so, the difference is reasonable... : )