The problem in reproducing the results of image captioning

Question

The problem in reproducing the results of image captioning

chenwq95 opened this issue 4 years ago · 5 comments

chenwq95 commented 4 years ago

Hi,
Thank you for your great work.

I'm trying to reproduce the results of image captioning by following steps:

Download Karpathy splits of COCO, and run the code of "scripts/prepro_labels.py" to prepare the data.
Download the Bottom-up and VC features with your link.
Train the model with the cross entropy loss:
"python train.py --id topdown --caption_model topdown --input_json data/cocotalk.json --input_label_h5 data/cocotalk_label.h5 --input_att_dir_vc [the/path/to/VC_Feature/trainval] --input_att_dir [the/path/to/Updown_Feature] --batch_size 50 --learning_rate 3e-4 --checkpoint_path log_topdown --save_checkpoint_every 2200 --val_images_use 5000 --rnn_size 2048 --input_encoding_size 1024 --max_epochs 30 --language_eval 1"
Evaluate the model with the code:
python eval.py --model log_topdown/model-best.pth --infos_path log_topdown/infos_topdown-best.pkl --dump_images 0 --num_images -1 --language_eval 1 --beam_size 2 --batch_size 50 --split test
The results are:
{'Bleu_1': 0.7625701835246635, 'Bleu_2': 0.6021042790224688, 'Bleu_3': 0.46398074453035226, 'Bleu_4': 0.35592428819070027, 'METEOR': 0.27917788348120276, 'ROUGE_L': 0.566515050577319, 'CIDEr': 1.136820918673527, 'bad_count_rate': 0.0014}
which are much lower than the reported.

So, my question is that are there any important settings when reproducing the reported results?

Answer 1 · 2020-08-24T11:52:55.000Z

My bad. It seems that self-critical training is important. I'm running the code with CIDEr-D Score Optimization, and will report my running results later.

Answer 2 · 2020-08-26T13:39:43.000Z

Hi, sorry for the delayed reply. Yeah, we follow the previous methods with self-critical training.
I haven't tried the CIDEr-D optimization actually, but your results is too lower as you said. For the reference, you can just check the original up-down model's performance. Using VC feature can definitely get better performance in both two periods (supervised and self-critical) .

To reproduce our results, it may be better to run the code follow the default flow firstly.
If you have any other problems, pls feel free to contact me.

Best
Tan

Answer 3 · 2020-08-29T13:59:32.000Z

Finally, I got the results after self-critical training:
{'Bleu_1': 0.8092960250603055, 'Bleu_2': 0.6543049978813428, 'Bleu_3': 0.5084688113516954, 'Bleu_4': 0.38850210505433325, 'METEOR': 0.2887807245138986, 'ROUGE_L': 0.5896889033565031, 'CIDEr': 1.2924082453621994, 'bad_count_rate': 0.0004}
after running the code:
python train.py --id topdown --caption_model topdown --input_json data/cocotalk.json --input_label_h5 data/cocotalk_label.h5 --input_att_dir_vc ../../../Seq2Seq_Transformer/data/mscoco_vc_features/vc_coco_trainval_2014 --input_att_dir ../../../Seq2Seq_Transformer/data/mscoco_bottom_up_features_version2 --batch_size 50 --learning_rate 3e-4 --checkpoint_path log_topdown_lr_3 --save_checkpoint_every 2200 --val_images_use 5000 --max_epochs 80 --rnn_size 2048 --input_encoding_size 1024 --self_critical_after 30 --language_eval 1 --learning_rate_decay_start 0 --scheduled_sampling_start 0 >log_train_mix.out

Answer 4 · 2020-08-31T15:14:56.000Z

Hi, thanks for your experiment and feedback.
Firstly I have checked my results again and the final CIDEr score on Karpathy test split is indeed about 130.5 and the best epoch is 77. The best val score is about 128.4.
So I am trying to find that if something goes wrong?
Do you run the test command to get the results on test split? And do you totally follow the command in repo?

Answer 5 · 2020-09-02T06:46:48.000Z

Hi. Sincerely thanks for your time and detailed reply.

Yes. I followed the command in the downstream page, and tested the model by running the code:
python eval.py --model log_topdown/model-best.pth --infos_path log_topdown/infos_topdown-best.pkl --dump_images 0 --num_images -1 --language_eval 1 --beam_size 2 --batch_size 50 --split test

Is it possible that the config --self_critical_after 30 need to be larger? Because training 30 epochs with cross-entropy loss does not yield considerable results, and may affect the second training stage.