bearpaw/pytorch-pose

cannot reproduce hg_s1_b1 result

Closed this issue · 6 comments

I noticed that you log here using a larger learnong rate from 0.001 and schedule=[150, 175, 200], below is part of your log

Epoch	LR	Train Loss	Val Loss	Train Acc	Val Acc	
1.000000	0.001000	0.001369	0.000828	0.070879	0.138562	
2.000000	0.001000	0.000856	0.001058	0.158208	0.200655	
3.000000	0.001000	0.000758	0.000854	0.213208	0.208725	
4.000000	0.001000	0.000699	0.000596	0.281929	0.384714	
5.000000	0.001000	0.000635	0.000575	0.337208	0.440630	
6.000000	0.001000	0.000582	0.000541	0.421062	0.487058	
7.000000	0.001000	0.000559	0.000521	0.467490	0.538204	
8.000000	0.001000	0.000536	0.000495	0.514954	0.582253	
9.000000	0.001000	0.000520	0.000483	0.549438	0.609111	
10.000000	0.001000	0.000506	0.000469	0.574788	0.634015	
11.000000	0.001000	0.000497	0.000475	0.595450	0.629678	
12.000000	0.001000	0.000488	0.000458	0.610554	0.655569	
13.000000	0.001000	0.000481	0.000464	0.621428	0.642120	
14.000000	0.001000	0.000475	0.000444	0.634942	0.674910	
15.000000	0.001000	0.000470	0.000445	0.643844	0.672073	
16.000000	0.001000	0.000465	0.000457	0.649695	0.644244	
17.000000	0.001000	0.000461	0.000434	0.657655	0.692058	
18.000000	0.001000	0.000455	0.000432	0.669486	0.699718	
19.000000	0.001000	0.000451	0.000431	0.675828	0.704502	
20.000000	0.001000	0.000450	0.000427	0.676318	0.705441	
21.000000	0.001000	0.000447	0.000423	0.685184	0.715312	
22.000000	0.001000	0.000444	0.000439	0.687975	0.685048	
23.000000	0.001000	0.000440	0.000420	0.694823	0.718964	
24.000000	0.001000	0.000439	0.000423	0.697721	0.718909	
25.000000	0.001000	0.000435	0.000417	0.704000	0.727210	
26.000000	0.001000	0.000433	0.000420	0.706374	0.718607	
27.000000	0.001000	0.000432	0.000414	0.706610	0.727208	
28.000000	0.001000	0.000429	0.000415	0.713337	0.726208	
29.000000	0.001000	0.000426	0.000414	0.718950	0.731994	

and my training log drops down drastically, with the same lr and schedule with you, momentum=0 (default) or 0.1(your model internal parameter)

1.000000	0.001000	0.000911	0.001155	0.144576	0.245235	
2.000000	0.001000	0.000635	6.696480	0.292642	0.002924	
3.000000	0.001000	0.000599	79.269006	0.368526	0.000000	
4.000000	0.001000	0.000577	342.079974	0.411786	0.001092	
5.000000	0.001000	0.000560	1973.556534	0.447012	0.000176	

is there any other should be changed on your default parameters ?

Hi @GarrickLin, same for me. Even I have tried hg_s8_b1 architecture by running the provided hg_s8_b1.sh file. (It is available in drive files.)

Is there any substantial change after your experiments @bearpaw?

Here are the results so far:

Epoch: 2 | LR: 0.00050000
Processing |################################| (1854/1854) Data: 0.000218s | Batch: 0.484s | Total: 0:17:43 | ETA: 0:00:01 | Loss: 0.0061 | Acc:  0.0001
Processing |################################| (247/247) Data: 0.000111s | Batch: 0.384s | Total: 0:01:34 | ETA: 0:00:01 | Loss: 0.0065 | Acc:  0.0012

Epoch: 3 | LR: 0.00050000
Processing |################################| (1854/1854) Data: 0.000162s | Batch: 0.393s | Total: 0:17:18 | ETA: 0:00:01 | Loss: 0.0056 | Acc:  0.0007
Processing |################################| (247/247) Data: 0.000095s | Batch: 0.380s | Total: 0:01:33 | ETA: 0:00:01 | Loss: 0.0250 | Acc:  0.0004

Epoch: 4 | LR: 0.00050000
Processing |################################| (1854/1854) Data: 0.000209s | Batch: 0.858s | Total: 0:17:30 | ETA: 0:00:01 | Loss: 0.0049 | Acc:  0.0015
Processing |################################| (247/247) Data: 0.000110s | Batch: 0.379s | Total: 0:01:33 | ETA: 0:00:01 | Loss: 0.7430 | Acc:  0.0000

Epoch: 5 | LR: 0.00050000
Processing |################################| (1854/1854) Data: 0.000161s | Batch: 0.402s | Total: 0:17:45 | ETA: 0:00:01 | Loss: 0.0044 | Acc:  0.0033
Processing |################################| (247/247) Data: 0.000104s | Batch: 0.396s | Total: 0:01:37 | ETA: 0:00:01 | Loss: 3.5382 | Acc:  0.0000

Epoch: 6 | LR: 0.00050000
Processing |################################| (1854/1854) Data: 0.000187s | Batch: 0.442s | Total: 0:17:53 | ETA: 0:00:01 | Loss: 0.0043 | Acc:  0.0064
Processing |################################| (247/247) Data: 0.000093s | Batch: 0.393s | Total: 0:01:37 | ETA: 0:00:01 | Loss: 20.5096 | Acc:  0.0000

@mkocabas try to reduce the learnong rate, which works for me

Which value did you use? 2.5e-4? Did you remain optimizer as RMSProp?

I didn't change the default optimizer, but it might be the problem. It works after reducing the init learning rate to a smaller one (not remember very well), you can have a try.

Thanks!

Sorry for the confusing, for batch size 6, you should use lr 2.5-e4. You can use larger lr for larger batch size.