fartashf/vsepp

Reproducing results

MauritsBleeker opened this issue · 5 comments

Hi,

First of al, thanks for sharing this great work!

I've difficulties reproducing the results from the paper as a baseline. I will talk about experiment #3.15 in this issue: VSE++ (ResNet), Flickr30k

So what I get from the paper, the config is the following:

  • 30 epochs
  • Load images from disk, no precomputed features?
  • lower the lr after 15 epochs.
  • lr goes from 0.0002 -> 0.00002

My question is: is the image-encoder here trained end-to-end or not. In other words, is ResNet152 only used as a fixed feature extractor, or is it optimized?

According to your documentation, VSE++ (and therefore I assume 3.14) can be reproduced by - only - using the --max_violation flag, but I get (way) lower results, do I need the --finetune flag as well?

Thanks,
Maurits

Hi Maurits,

I just reproduced the results for row 3.15 (Flickr30k ResNet without finetune) using the following command:

python train.py --logger_name runs/X --data_name f30k --cnn_type resnet152 --max_violation --num_epochs 30 --lr_update 15

The setup is PyTorch 1.4.0 and Python 3.7.1 and I used the changes in the branch pytorch4.1 and python3. The final result as printed is:

Image to text: 43.8, 72.4, 81.8, 2.0, 13.4
Text to image: 31.6, 59.6, 69.7, 3.0, 26.6

Were you running the same command as above? How big is the gap?

yeah, I use the same command.

Do you use a random seed? I use different Python and Torch versions, but that should not give that much of a difference right? I will share my results later today.

Thanks again,

Maurits

The seed is not fixed, sorry. Unfortunately, I had not done that in the original code and did not report standard deviations.
Nevertheless, I don't expect std to be higher than 1%.
Make sure the experiment runs for the entire length of training. The recall at the end of the first few epochs for VSE++ is near zero but it picks up quickly.

Okay, I've managed to reproduce the results (finally). I still don't know what was the problem in the end.

Thanks for your feedback,

Maurits

Average i2t Recall: 66.9
mage to text: 45.1 73.5 82.0 2.0 11.7

Average t2i Recall: 55.7
Text to image: 32.8 62.3 72.2 3.0 21.9

Sounds great. Thanks for reporting the result.