Can't reproduce paper's result.

Question

Can't reproduce paper's result.

CHELSEA234 opened this issue 7 years ago · 7 comments

Hi, I like your work and try to build my own research based on your idea, but I simply couldn't reproduce your paper's result.

Here is what I have done:
python3 src/translate.py --action walking --seq_length_out 25
python3 src/translate.py --residual_velocities --action walking

What I have got:
Aside from plausible animations for each action, following table is what I got from my experiment:

'Long term' and 'YOUR WORK' are sampling-based loss(SA) from my experiment and your reported results, the last one is for SRNN paper's motion forecasting error.

Walking\ time(ms) 80 | 160 | 320 | 400 | 560 | 1000
Long term | 1.004 | 1.190 | 1.473 | 1.594 | 1.794 | 2.027
YOUR WORK | 0.92 | 0.98 | 1.02 | 1.20
SRNN paper's |1.08 | 1.34 | 1.60 | --- | 1.90 | 2.13

Eating\ time(ms) 80 | 160 | 320 | 400 | 560 | 1000
Long term | 1.195 | 1.473 | 1.998 | 2.184 | 2.316 | 2.336
YOUR WORK | 0.98 | 0.99 | 1.18 | 1.31
SRNN paper's |1.35 | 1.71 | 2.12 | --- | 2.28 | 2.58

Smoking\ time(ms) 80 | 160 | 320 | 400 | 560 | 1000
Long term | 1.282 | 1.572 | 2.486 | 2.609 | 3.258 | 2.861
YOUR WORK |1.38 | 1.39 | 1.56 | 1.65
SRNN paper's |1.90 | 2.30 | 2.90 | ---- | 3.21 | 3.23

Discussion\ time(ms) 80 | 160 | 320 | 400 | 560 | 1000
Long term | 1.605 | 1.986 | 2.513 | 2.702 | 3.087 | 3.187
YOUR WORK | 1.78 | 1.80 | 1.83 | 1.90
SRNN paper's |1.67 | 2.03 | 2.20 | ---- | 2.39 | 2.43

I am feeling puzzled about:

According to your result, I suppose my answer is wrong, but tolerable when they are compared with SRNN paper's result. Can you give some advice for correcting my work?
I think maybe the iteration about 1e5 is kinda too large, because I noticed that error would grow larger as iteration increases.

Looking forward your reply, sincerely thanks!!!

=====================================
UPDATE

(number in boldface indicates the best result)
Walking\ time(ms) 80 | 160 | 320 | 400 | 560 | 1000
1e4th iteration | 1.306 | 1.360 | 1.362 | 1.380 | 1.381 | 1.488
2e4th iteration |1.195 | 1.276 | 1.318 | 1.345 | 1.401 | 1.554
YOUR WORK | 0.92 | 0.98 | 1.02 | 1.20

Eating\ time(ms) 80 | 160 | 320 | 400 | 560 | 1000
1e4th iteration |1.126 | 1.189 | 1.300 | 1.380 | 1.507 | 1.752
2e4th iteration |1.043 | 1.162 | 1.379 | 1.497 | 1.674 | 2.036
YOUR WORK | 0.98 | 0.99 | 1.18 | 1.31

Smoking\ time(ms) 80 | 160 | 320 | 400 | 560 | 1000
1e4th iteration | 1.514 | 1.597 | 1.752 | 1.789 | 1.862 | 2.257
2e4th iteration | 1.238 | 1.357 | 1.593 | 1.640 | 1.738 | 2.196
YOUR WORK |1.38 | 1.39 | 1.56 | 1.65

Discussion\ time(ms) 80 | 160 | 320 | 400 | 560 | 1000
1e4th iteration |1.682 | 1.803 | 1.847 | 1.825 | 1.952 | 2.185
2e4th iteration |1.439 | 1.603 | 1.710 | 1.728 | 1.938 | 2.196
YOUR WORK | 1.78 | 1.80 | 1.83 | 1.90

Taking your advice, I checked 10000th and 20000th iteration's result, they improved performance, Thanks!! I suppose to choose 20000th iteration is better one for sampling-based loss experiment,
but gaps still exist, especially in walking and eating actions, is this normal?

Really sorry to bother you, is code on github your final version or just a demo, if not, what changes should I make? BTW, is iteration time uniform for all experiments? for example, Seq2seq architecture and sampling-based loss(SA) and Residual architecture(Residual(SA)) are all using 10000 iterations?

Answer 1 · 2018-01-14T03:46:10.000Z

Hi! Thanks for reporting this.

Sorry, what are the Long term (one-hot) and Short term (one-hot) results from your post?

Indeed, the number of iterations at 1e5 seems a bit too large. I'd suggest using 1e4 as in the demos. Please let me know if that doesn't work and I can have a closer look tomorrow.

Answer 2 · 2018-01-14T04:00:31.000Z

Also, what result in our paper are you referring to? We reported multiple models and baselines.

Answer 3 · 2018-01-14T04:01:23.000Z

Finally re: 4 hours. What hardware are you using? I just ran your second command and it took <5 minutes for 1e4 iterations on a machine with a Titan Xp.

Answer 4 · 2018-01-14T23:06:14.000Z

Hi, thanks for your patience. I omit some unimportant details to rephrase my question here.
I used small iteration number and want to consult you about how to get better result.
Thanks @una-dinosauria

Answer 5 · 2018-01-15T00:53:12.000Z

Hi @CHELSEA234.

Please don't edit your posts above. It makes it hard for future readers to follow the conversation. I'd appreciate if you can put additional question and information in new messages.
It makes sense that the results don't perfectly matched those of the paper -- there is random initialization and every optimization is different. We reported averages of multiple (5 if I remember correctly) experiments.
Since Single-Action (SA) experiments have small, yet variable amounts of data, it is hard to find a number of iterations that works for all actions, as some models will overfit faster than others. A common practice is to train for a large number of iterations, and simply keep the model that performed best on a validation set. You can keep track of validation results on tensorboard, and simply check the best number that you see there.

Cheers,

Answer 6 · 2018-01-17T08:30:58.000Z

Thanks for your patient and detailed reply!

I will pay attention to this next time, won't edit my comment again.
Referring to your suggestions (point 2 and 3 in last comment), I did average calculation on 3 experiments, each experiment's best result within 20000th iteration was recorded. In ''Residual(SA)" experiment results are close to each other, but they doesn't perfectly match still in "sampling-loss base".
I think it is acceptable to some extents, do you think so? BTW, What did you mean 'every optimization is different'?

sampling-loss base(SA):
Walking\ time(ms) 80 | 160 | 320 | 400 | 560 | 1000
my average result 1.180 | 1.245 | 1.29 | 1.314 | 1.373 | 1.501
YOUR WORK | 0.92 | 0.98 | 1.02 | 1.20

Eating\ time(ms) 80 | 160 | 320 | 400 | 560 | 1000
my average result 1.01 | 1.111 | 1.312 | 1.431 | 1.613| 1.978
YOUR WORK | 0.98 | 0.99 | 1.18 | 1.31

Smoking\ time(ms) 80 | 160 | 320 | 400 | 560 | 1000
my average result 1.23 | 1.35 | 1.60 | 1.65 | 1.74 | 2.19
YOUR WORK |1.38 | 1.39 | 1.56 | 1.65

Discussion\ time(ms) 80 | 160 | 320 | 400 | 560 | 1000
my average result 1.432 | 1.60 | 1.69 | 1.70 | 1.93 | 2.19
YOUR WORK | 1.78 | 1.80 | 1.83 | 1.90

Residual(SA):

Walking\ time(ms) 80 | 160 | 320 | 400
my result 0.365 | 0.619 | 0.886 | 0.991
YOUR WORK 0.34 | 0.6 | 0.95 | 1.09

Eating\ time(ms) 80 | 160 | 320 | 400
my result 0.292 | 0.523 | 0.919 | 1.116
YOUR WORK 0.3 | 0.53 | 0.92 | 1.13

Smoking\ time(ms) 80 | 160 | 320 | 400
my result 0.36 | 0.666 | 1.219 | 1.321
YOUR WORK 0.36 | 0.66 | 1.17 | 1.27

Discussion\ time(ms) 80 | 160 | 320 | 400
my result 0.418 | 0.886 | 1.336 | 1.439
YOUR WORK 0.44 | 0.93 | 1.45 | 1.6

Best wishes,

Answer 7 · 2018-01-17T15:28:34.000Z

Thanks for the update.

In ''Residual(SA)" experiment results are close to each other, but they doesn't perfectly match still in "sampling-loss base". I think it is acceptable to some extents, do you think so?

I think this makes sense. About 1/2 the results will be better and 1/2 will be worse, and it'll be hard to perfectly match what the paper says.

This probably also reflects on how small these training and validation sets are. IMO 8 sequences for testing are way too few, but when one writes a paper one usually has to stick with what previous work has done. In this sense, reproducibility is yet another advantage of big data -- eg, check out our work on 3d pose estimation; it has ~100K test poses and reproducing the results in the paper within +- 0.5 is very very easy.

BTW, What did you mean 'every optimization is different'?

I just meant that given random initialization, the end point of the optimization is likely to be different.

===================================== UPDATE

=====================================
UPDATE