Reproducing results
MauritsBleeker opened this issue · 5 comments
Hi,
First of al, thanks for sharing this great work!
I've difficulties reproducing the results from the paper as a baseline. I will talk about experiment #3.15 in this issue: VSE++ (ResNet), Flickr30k
So what I get from the paper, the config is the following:
- 30 epochs
- Load images from disk, no precomputed features?
- lower the lr after 15 epochs.
- lr goes from 0.0002 -> 0.00002
My question is: is the image-encoder here trained end-to-end or not. In other words, is ResNet152 only used as a fixed feature extractor, or is it optimized?
According to your documentation, VSE++ (and therefore I assume 3.14) can be reproduced by - only - using the --max_violation
flag, but I get (way) lower results, do I need the --finetune
flag as well?
Thanks,
Maurits
Hi Maurits,
I just reproduced the results for row 3.15 (Flickr30k ResNet without finetune) using the following command:
python train.py --logger_name runs/X --data_name f30k --cnn_type resnet152 --max_violation --num_epochs 30 --lr_update 15
The setup is PyTorch 1.4.0 and Python 3.7.1 and I used the changes in the branch pytorch4.1 and python3. The final result as printed is:
Image to text: 43.8, 72.4, 81.8, 2.0, 13.4
Text to image: 31.6, 59.6, 69.7, 3.0, 26.6
Were you running the same command as above? How big is the gap?
yeah, I use the same command.
Do you use a random seed? I use different Python and Torch versions, but that should not give that much of a difference right? I will share my results later today.
Thanks again,
Maurits
The seed is not fixed, sorry. Unfortunately, I had not done that in the original code and did not report standard deviations.
Nevertheless, I don't expect std to be higher than 1%.
Make sure the experiment runs for the entire length of training. The recall at the end of the first few epochs for VSE++ is near zero but it picks up quickly.
Okay, I've managed to reproduce the results (finally). I still don't know what was the problem in the end.
Thanks for your feedback,
Maurits
Average i2t Recall: 66.9
mage to text: 45.1 73.5 82.0 2.0 11.7
Average t2i Recall: 55.7
Text to image: 32.8 62.3 72.2 3.0 21.9
Sounds great. Thanks for reporting the result.