Weird loss progression
RaphaelRoyerRivard opened this issue · 4 comments
Since I am training the model on VLOG with a very small batch size, the training is going to take forever (8 days). And because I don't want to wait that long, I'll stop the training before 30 epochs. But the losses shown in the logs seem odd to me. Can someone provide me the log of a complete training so I can compare the losses and see if my early results are normal or not? Thanks
Learning Rate Train Loss Theta Loss Theta Skip Loss
0.000200 -0.002401 0.366067 0.331109
0.000200 -0.002381 0.369635 0.328924
0.000200 -0.001740 0.402181 0.374113
0.000200 -0.001929 0.378956 0.342752
One example log will be in following.
Note that the current code will not give you the exact same loss, but the trend of how the loss is developed will be similar
Learning Rate Train Loss Theta Loss Theta Skip Loss
0.000200 -0.023201 0.223515 0.185768
0.000200 -0.082967 0.149054 0.120956
0.000200 -0.121153 0.138757 0.109839
0.000200 -0.141511 0.132837 0.103349
0.000200 -0.154124 0.130685 0.101065
0.000200 -0.164161 0.126941 0.097509
0.000200 -0.171910 0.124375 0.094423
0.000200 -0.177002 0.123230 0.092237
0.000200 -0.182402 0.120037 0.089529
0.000200 -0.186588 0.118543 0.086799
0.000200 -0.189803 0.116007 0.084808
0.000200 -0.192916 0.114425 0.082736
0.000200 -0.196440 0.112402 0.080228
0.000200 -0.198626 0.111003 0.079104
0.000200 -0.200321 0.109698 0.077720
0.000200 -0.201791 0.108161 0.076239
0.000200 -0.204281 0.105937 0.073543
0.000200 -0.207024 0.104847 0.071410
0.000200 -0.207578 0.102365 0.069629
0.000200 -0.209727 0.101646 0.069230
0.000200 -0.210965 0.100404 0.067125
0.000200 -0.213229 0.097842 0.064572
0.000200 -0.214765 0.096944 0.063795
0.000200 -0.215127 0.095416 0.062738
0.000200 -0.215839 0.094996 0.062121
0.000200 -0.217097 0.093684 0.060339
0.000200 -0.219261 0.092733 0.059287
0.000200 -0.219723 0.091869 0.058745
0.000200 -0.221097 0.091318 0.058428
0.000200 -0.221912 0.090675 0.058063
The only things I modified in your code are the YOUR_DATASET_FOLDER
to put my path and some other path that was hardcoded.
I ran the following command on VLOG (resized to 256)
python train_cycle_siple.py --checkpoint pytorch_checkpoints/release_model_simple --batchSize 4 --workers 4
but the losses are very different from yours...
Learning Rate Train Loss Theta Loss Theta Skip Loss
0.000200 -0.002401 0.366067 0.331109
0.000200 -0.002381 0.369635 0.328924
0.000200 -0.001740 0.402181 0.374113
0.000200 -0.001929 0.378956 0.342752
0.000200 -0.001893 0.402664 0.362544
0.000200 -0.001851 0.384101 0.343538
0.000200 -0.001888 0.392817 0.348998
0.000200 -0.002026 0.373430 0.329414
0.000200 -0.002127 0.374545 0.322591
0.000200 -0.002059 0.373383 0.322823
0.000200 -0.002283 0.347109 0.295166
0.000200 -0.002365 0.354452 0.294233
0.000200 -0.002127 0.369732 0.314337
0.000200 -0.002101 0.369753 0.312066
0.000200 -0.002192 0.354708 0.296371
0.000200 -0.002064 0.373753 0.311506
0.000200 -0.002031 0.386576 0.323555
0.000200 -0.001990 0.379806 0.317385
0.000200 -0.001882 0.391573 0.329034
0.000200 -0.002011 0.374667 0.311523
0.000200 -0.001822 0.412275 0.347809
0.000200 -0.001636 0.460999 0.391921
0.000200 -0.001858 0.373273 0.313632
0.000200 -0.001881 0.371901 0.308502
The train loss is slightly increasing instead of getting lower like yours and the two other losses are not really changing... Do you have an idea of what is going on?
Thank you
very small batch size will work badly for batch norm, you will also need to adjust the learning rate according to the batch size, if you divide the batch size by 8, you should also divide the lr by 8
Thank you for your fast answer, I will try that.