Can't reproduce paper's result.
CHELSEA234 opened this issue · 7 comments
Hi, I like your work and try to build my own research based on your idea, but I simply couldn't reproduce your paper's result.
Here is what I have done:
python3 src/translate.py --action walking --seq_length_out 25
python3 src/translate.py --residual_velocities --action walking
What I have got:
Aside from plausible animations for each action, following table is what I got from my experiment:
'Long term' and 'YOUR WORK' are sampling-based loss(SA) from my experiment and your reported results, the last one is for SRNN paper's motion forecasting error.
Walking\ time(ms) 80 | 160 | 320 | 400 | 560 | 1000
Long term | 1.004 | 1.190 | 1.473 | 1.594 | 1.794 | 2.027
YOUR WORK | 0.92 | 0.98 | 1.02 | 1.20
SRNN paper's |1.08 | 1.34 | 1.60 | --- | 1.90 | 2.13
Eating\ time(ms) 80 | 160 | 320 | 400 | 560 | 1000
Long term | 1.195 | 1.473 | 1.998 | 2.184 | 2.316 | 2.336
YOUR WORK | 0.98 | 0.99 | 1.18 | 1.31
SRNN paper's |1.35 | 1.71 | 2.12 | --- | 2.28 | 2.58
Smoking\ time(ms) 80 | 160 | 320 | 400 | 560 | 1000
Long term | 1.282 | 1.572 | 2.486 | 2.609 | 3.258 | 2.861
YOUR WORK |1.38 | 1.39 | 1.56 | 1.65
SRNN paper's |1.90 | 2.30 | 2.90 | ---- | 3.21 | 3.23
Discussion\ time(ms) 80 | 160 | 320 | 400 | 560 | 1000
Long term | 1.605 | 1.986 | 2.513 | 2.702 | 3.087 | 3.187
YOUR WORK | 1.78 | 1.80 | 1.83 | 1.90
SRNN paper's |1.67 | 2.03 | 2.20 | ---- | 2.39 | 2.43
I am feeling puzzled about:
-
According to your result, I suppose my answer is wrong, but tolerable when they are compared with SRNN paper's result. Can you give some advice for correcting my work?
-
I think maybe the iteration about 1e5 is kinda too large, because I noticed that error would grow larger as iteration increases.
Looking forward your reply, sincerely thanks!!!
=====================================
UPDATE
(number in boldface indicates the best result)
Walking\ time(ms) 80 | 160 | 320 | 400 | 560 | 1000
1e4th iteration | 1.306 | 1.360 | 1.362 | 1.380 | 1.381 | 1.488
2e4th iteration |1.195 | 1.276 | 1.318 | 1.345 | 1.401 | 1.554
YOUR WORK | 0.92 | 0.98 | 1.02 | 1.20
Eating\ time(ms) 80 | 160 | 320 | 400 | 560 | 1000
1e4th iteration |1.126 | 1.189 | 1.300 | 1.380 | 1.507 | 1.752
2e4th iteration |1.043 | 1.162 | 1.379 | 1.497 | 1.674 | 2.036
YOUR WORK | 0.98 | 0.99 | 1.18 | 1.31
Smoking\ time(ms) 80 | 160 | 320 | 400 | 560 | 1000
1e4th iteration | 1.514 | 1.597 | 1.752 | 1.789 | 1.862 | 2.257
2e4th iteration | 1.238 | 1.357 | 1.593 | 1.640 | 1.738 | 2.196
YOUR WORK |1.38 | 1.39 | 1.56 | 1.65
Discussion\ time(ms) 80 | 160 | 320 | 400 | 560 | 1000
1e4th iteration |1.682 | 1.803 | 1.847 | 1.825 | 1.952 | 2.185
2e4th iteration |1.439 | 1.603 | 1.710 | 1.728 | 1.938 | 2.196
YOUR WORK | 1.78 | 1.80 | 1.83 | 1.90
Taking your advice, I checked 10000th and 20000th iteration's result, they improved performance, Thanks!! I suppose to choose 20000th iteration is better one for sampling-based loss experiment,
but gaps still exist, especially in walking and eating actions, is this normal?
Really sorry to bother you, is code on github your final version or just a demo, if not, what changes should I make? BTW, is iteration time uniform for all experiments? for example, Seq2seq architecture and sampling-based loss(SA) and Residual architecture(Residual(SA)) are all using 10000 iterations?
Hi! Thanks for reporting this.
Sorry, what are the Long term (one-hot)
and Short term (one-hot)
results from your post?
Indeed, the number of iterations at 1e5 seems a bit too large. I'd suggest using 1e4 as in the demos. Please let me know if that doesn't work and I can have a closer look tomorrow.
Also, what result in our paper are you referring to? We reported multiple models and baselines.
Finally re: 4 hours. What hardware are you using? I just ran your second command and it took <5 minutes for 1e4 iterations on a machine with a Titan Xp.
Hi, thanks for your patience. I omit some unimportant details to rephrase my question here.
I used small iteration number and want to consult you about how to get better result.
Thanks @una-dinosauria
Hi @CHELSEA234.
- Please don't edit your posts above. It makes it hard for future readers to follow the conversation. I'd appreciate if you can put additional question and information in new messages.
- It makes sense that the results don't perfectly matched those of the paper -- there is random initialization and every optimization is different. We reported averages of multiple (5 if I remember correctly) experiments.
- Since Single-Action (SA) experiments have small, yet variable amounts of data, it is hard to find a number of iterations that works for all actions, as some models will overfit faster than others. A common practice is to train for a large number of iterations, and simply keep the model that performed best on a validation set. You can keep track of validation results on tensorboard, and simply check the best number that you see there.
Cheers,
Thanks for your patient and detailed reply!
- I will pay attention to this next time, won't edit my comment again.
- Referring to your suggestions (point 2 and 3 in last comment), I did average calculation on 3 experiments, each experiment's best result within 20000th iteration was recorded. In ''Residual(SA)" experiment results are close to each other, but they doesn't perfectly match still in "sampling-loss base".
I think it is acceptable to some extents, do you think so? BTW, What did you mean 'every optimization is different'?
sampling-loss base(SA):
Walking\ time(ms) 80 | 160 | 320 | 400 | 560 | 1000
my average result 1.180 | 1.245 | 1.29 | 1.314 | 1.373 | 1.501
YOUR WORK | 0.92 | 0.98 | 1.02 | 1.20
Eating\ time(ms) 80 | 160 | 320 | 400 | 560 | 1000
my average result 1.01 | 1.111 | 1.312 | 1.431 | 1.613| 1.978
YOUR WORK | 0.98 | 0.99 | 1.18 | 1.31
Smoking\ time(ms) 80 | 160 | 320 | 400 | 560 | 1000
my average result 1.23 | 1.35 | 1.60 | 1.65 | 1.74 | 2.19
YOUR WORK |1.38 | 1.39 | 1.56 | 1.65
Discussion\ time(ms) 80 | 160 | 320 | 400 | 560 | 1000
my average result 1.432 | 1.60 | 1.69 | 1.70 | 1.93 | 2.19
YOUR WORK | 1.78 | 1.80 | 1.83 | 1.90
Residual(SA):
Walking\ time(ms) 80 | 160 | 320 | 400
my result 0.365 | 0.619 | 0.886 | 0.991
YOUR WORK 0.34 | 0.6 | 0.95 | 1.09
Eating\ time(ms) 80 | 160 | 320 | 400
my result 0.292 | 0.523 | 0.919 | 1.116
YOUR WORK 0.3 | 0.53 | 0.92 | 1.13
Smoking\ time(ms) 80 | 160 | 320 | 400
my result 0.36 | 0.666 | 1.219 | 1.321
YOUR WORK 0.36 | 0.66 | 1.17 | 1.27
Discussion\ time(ms) 80 | 160 | 320 | 400
my result 0.418 | 0.886 | 1.336 | 1.439
YOUR WORK 0.44 | 0.93 | 1.45 | 1.6
Best wishes,
Thanks for the update.
In ''Residual(SA)" experiment results are close to each other, but they doesn't perfectly match still in "sampling-loss base". I think it is acceptable to some extents, do you think so?
I think this makes sense. About 1/2 the results will be better and 1/2 will be worse, and it'll be hard to perfectly match what the paper says.
This probably also reflects on how small these training and validation sets are. IMO 8 sequences for testing are way too few, but when one writes a paper one usually has to stick with what previous work has done. In this sense, reproducibility is yet another advantage of big data -- eg, check out our work on 3d pose estimation; it has ~100K test poses and reproducing the results in the paper within +- 0.5 is very very easy.
BTW, What did you mean 'every optimization is different'?
I just meant that given random initialization, the end point of the optimization is likely to be different.