xiaolonw/TimeCycle

Weird loss progression

RaphaelRoyerRivard opened this issue · 4 comments

Since I am training the model on VLOG with a very small batch size, the training is going to take forever (8 days). And because I don't want to wait that long, I'll stop the training before 30 epochs. But the losses shown in the logs seem odd to me. Can someone provide me the log of a complete training so I can compare the losses and see if my early results are normal or not? Thanks

Learning Rate	Train Loss	Theta Loss	Theta Skip Loss	
0.000200	-0.002401	0.366067	0.331109	
0.000200	-0.002381	0.369635	0.328924	
0.000200	-0.001740	0.402181	0.374113	
0.000200	-0.001929	0.378956	0.342752

One example log will be in following.

Note that the current code will not give you the exact same loss, but the trend of how the loss is developed will be similar

Learning Rate Train Loss Theta Loss Theta Skip Loss
0.000200 -0.023201 0.223515 0.185768
0.000200 -0.082967 0.149054 0.120956
0.000200 -0.121153 0.138757 0.109839
0.000200 -0.141511 0.132837 0.103349
0.000200 -0.154124 0.130685 0.101065
0.000200 -0.164161 0.126941 0.097509
0.000200 -0.171910 0.124375 0.094423
0.000200 -0.177002 0.123230 0.092237
0.000200 -0.182402 0.120037 0.089529
0.000200 -0.186588 0.118543 0.086799
0.000200 -0.189803 0.116007 0.084808
0.000200 -0.192916 0.114425 0.082736
0.000200 -0.196440 0.112402 0.080228
0.000200 -0.198626 0.111003 0.079104
0.000200 -0.200321 0.109698 0.077720
0.000200 -0.201791 0.108161 0.076239
0.000200 -0.204281 0.105937 0.073543
0.000200 -0.207024 0.104847 0.071410
0.000200 -0.207578 0.102365 0.069629
0.000200 -0.209727 0.101646 0.069230
0.000200 -0.210965 0.100404 0.067125
0.000200 -0.213229 0.097842 0.064572
0.000200 -0.214765 0.096944 0.063795
0.000200 -0.215127 0.095416 0.062738
0.000200 -0.215839 0.094996 0.062121
0.000200 -0.217097 0.093684 0.060339
0.000200 -0.219261 0.092733 0.059287
0.000200 -0.219723 0.091869 0.058745
0.000200 -0.221097 0.091318 0.058428
0.000200 -0.221912 0.090675 0.058063

The only things I modified in your code are the YOUR_DATASET_FOLDER to put my path and some other path that was hardcoded.
I ran the following command on VLOG (resized to 256)
python train_cycle_siple.py --checkpoint pytorch_checkpoints/release_model_simple --batchSize 4 --workers 4
but the losses are very different from yours...

Learning Rate	Train Loss	Theta Loss	Theta Skip Loss	
0.000200	-0.002401	0.366067	0.331109	
0.000200	-0.002381	0.369635	0.328924	
0.000200	-0.001740	0.402181	0.374113	
0.000200	-0.001929	0.378956	0.342752	
0.000200	-0.001893	0.402664	0.362544	
0.000200	-0.001851	0.384101	0.343538	
0.000200	-0.001888	0.392817	0.348998	
0.000200	-0.002026	0.373430	0.329414	
0.000200	-0.002127	0.374545	0.322591	
0.000200	-0.002059	0.373383	0.322823	
0.000200	-0.002283	0.347109	0.295166	
0.000200	-0.002365	0.354452	0.294233	
0.000200	-0.002127	0.369732	0.314337	
0.000200	-0.002101	0.369753	0.312066	
0.000200	-0.002192	0.354708	0.296371	
0.000200	-0.002064	0.373753	0.311506	
0.000200	-0.002031	0.386576	0.323555	
0.000200	-0.001990	0.379806	0.317385	
0.000200	-0.001882	0.391573	0.329034	
0.000200	-0.002011	0.374667	0.311523	
0.000200	-0.001822	0.412275	0.347809	
0.000200	-0.001636	0.460999	0.391921	
0.000200	-0.001858	0.373273	0.313632	
0.000200	-0.001881	0.371901	0.308502	

The train loss is slightly increasing instead of getting lower like yours and the two other losses are not really changing... Do you have an idea of what is going on?
Thank you

very small batch size will work badly for batch norm, you will also need to adjust the learning rate according to the batch size, if you divide the batch size by 8, you should also divide the lr by 8

Thank you for your fast answer, I will try that.