caption my own video with provided pretrained model
dawnlh opened this issue · 8 comments
Hi, thanks for the wonderful work.
I want to caption my own videos giving the video frames (without transcript), can I use the pretrained weight (univl.pretrained.bin
) provided in the repository directly to finish this task? I evaluated the pretained weightunivl.pretrained.bin
directly on MSRVTT with the following code,
DATATYPE="msrvtt"
TRAIN_CSV="data/msrvtt/MSRVTT_train.9k.csv"
VAL_CSV="data/msrvtt/MSRVTT_JSFUSION_test.csv"
DATA_PATH="data/msrvtt/MSRVTT_data.json"
FEATURES_PATH="data/msrvtt/msrvtt_videos_features.pickle"
INIT_MODEL="weight/univl.pretrained.bin"
OUTPUT_ROOT="ckpts"
python -m torch.distributed.launch --nproc_per_node=1 \
main_task_caption.py \
--do_eval --num_thread_reader=4 \
--val_csv ${VAL_CSV} \
--data_path ${DATA_PATH} \
--features_path ${FEATURES_PATH} \
--output_dir ${OUTPUT_ROOT}/ckpt_msrvtt_caption --bert_model bert-base-uncased \
--do_lower_case \
--batch_size_val 32 --visual_num_hidden_layers 6 \
--decoder_num_hidden_layers 3 --datatype ${DATATYPE} --stage_two \
--init_model ${INIT_MODEL}
but got a very low metric value:
BLEU_1: 0.1410, BLEU_2: 0.0450, BLEU_3: 0.0142, BLEU_4: 0.0052
METEOR: 0.0684, ROUGE_L: 0.1229, CIDEr: 0.0045
Emmm, I'm a fresher of this field, I would appreciate it a lot if you can provide some suggestions, instructions or codes on making use of provided pretrained model to deal with video captioning tasks in the real cases. (Perhaps main points lie in pretrained model, feature extraction and result visualization?)
Hi @dawnlh, would you provide your log.txt here? I can not locate the problem through the command.
Hi @dawnlh, would you provide your log.txt here? I can not locate the problem through the command.
Thanks a lot! Here is the log file:
2021-05-25 11:15:57,643:INFO: Effective parameters:
2021-05-25 11:15:57,644:INFO: <<< batch_size: 256
2021-05-25 11:15:57,644:INFO: <<< batch_size_val: 32
2021-05-25 11:15:57,644:INFO: <<< bert_model: bert-base-uncased
2021-05-25 11:15:57,644:INFO: <<< cache_dir:
2021-05-25 11:15:57,644:INFO: <<< coef_lr: 0.1
2021-05-25 11:15:57,644:INFO: <<< cross_model: cross-base
2021-05-25 11:15:57,644:INFO: <<< cross_num_hidden_layers: 2
2021-05-25 11:15:57,644:INFO: <<< data_path: data/msrvtt/MSRVTT_data.json
2021-05-25 11:15:57,644:INFO: <<< datatype: msrvtt
2021-05-25 11:15:57,644:INFO: <<< decoder_model: decoder-base
2021-05-25 11:15:57,644:INFO: <<< decoder_num_hidden_layers: 3
2021-05-25 11:15:57,644:INFO: <<< do_eval: True
2021-05-25 11:15:57,644:INFO: <<< do_lower_case: True
2021-05-25 11:15:57,644:INFO: <<< do_pretrain: False
2021-05-25 11:15:57,644:INFO: <<< do_train: False
2021-05-25 11:15:57,644:INFO: <<< epochs: 20
2021-05-25 11:15:57,644:INFO: <<< feature_framerate: 1
2021-05-25 11:15:57,644:INFO: <<< features_path: data/msrvtt/msrvtt_videos_features.pickle
2021-05-25 11:15:57,644:INFO: <<< fp16: False
2021-05-25 11:15:57,644:INFO: <<< fp16_opt_level: O1
2021-05-25 11:15:57,644:INFO: <<< gradient_accumulation_steps: 1
2021-05-25 11:15:57,644:INFO: <<< hard_negative_rate: 0.5
2021-05-25 11:15:57,644:INFO: <<< init_model: weight/univl.pretrained.bin
2021-05-25 11:15:57,644:INFO: <<< local_rank: 0
2021-05-25 11:15:57,644:INFO: <<< lr: 0.0001
2021-05-25 11:15:57,644:INFO: <<< lr_decay: 0.9
2021-05-25 11:15:57,644:INFO: <<< margin: 0.1
2021-05-25 11:15:57,644:INFO: <<< max_frames: 100
2021-05-25 11:15:57,644:INFO: <<< max_words: 20
2021-05-25 11:15:57,644:INFO: <<< min_time: 5.0
2021-05-25 11:15:57,645:INFO: <<< n_display: 100
2021-05-25 11:15:57,645:INFO: <<< n_gpu: 1
2021-05-25 11:15:57,645:INFO: <<< n_pair: 1
2021-05-25 11:15:57,645:INFO: <<< negative_weighting: 1
2021-05-25 11:15:57,645:INFO: <<< num_thread_reader: 4
2021-05-25 11:15:57,645:INFO: <<< output_dir: ckpts/ckpt_msrvtt_caption
2021-05-25 11:15:57,645:INFO: <<< sampled_use_mil: False
2021-05-25 11:15:57,645:INFO: <<< seed: 42
2021-05-25 11:15:57,645:INFO: <<< stage_two: True
2021-05-25 11:15:57,645:INFO: <<< task_type: caption
2021-05-25 11:15:57,645:INFO: <<< text_num_hidden_layers: 12
2021-05-25 11:15:57,645:INFO: <<< train_csv: data/youcookii_singlef_train.csv
2021-05-25 11:15:57,645:INFO: <<< use_mil: False
2021-05-25 11:15:57,645:INFO: <<< val_csv: data/msrvtt/MSRVTT_JSFUSION_test.csv
2021-05-25 11:15:57,645:INFO: <<< video_dim: 1024
2021-05-25 11:15:57,645:INFO: <<< visual_model: visual-base
2021-05-25 11:15:57,645:INFO: <<< visual_num_hidden_layers: 6
2021-05-25 11:15:57,645:INFO: <<< warmup_proportion: 0.1
2021-05-25 11:15:57,645:INFO: <<< world_size: 1
2021-05-25 11:15:57,646:INFO: device: cuda:0 n_gpu: 1
2021-05-25 11:15:57,646:INFO: loading vocabulary file /data2/zzh/project/SCI_caption/UniVL/modules/bert-base-uncased/vocab.txt
2021-05-25 11:15:58,017:INFO: loading archive file /data2/zzh/project/SCI_caption/UniVL/modules/bert-base-uncased
2021-05-25 11:15:58,018:INFO: Model config {
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"type_vocab_size": 2,
"vocab_size": 30522
}
2021-05-25 11:15:58,018:INFO: loading archive file /data2/zzh/project/SCI_caption/UniVL/modules/visual-base
2021-05-25 11:15:58,018:INFO: Model config {
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 1,
"type_vocab_size": 2,
"vocab_size": 1024
}
2021-05-25 11:15:58,018:INFO: Weight doesn't exsits. /data2/zzh/project/SCI_caption/UniVL/modules/visual-base/visual_pytorch_model.bin
2021-05-25 11:15:58,018:INFO: loading archive file /data2/zzh/project/SCI_caption/UniVL/modules/cross-base
2021-05-25 11:15:58,018:INFO: Model config {
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"max_position_embeddings": 1024,
"num_attention_heads": 12,
"num_hidden_layers": 2,
"type_vocab_size": 2,
"vocab_size": 768
}
2021-05-25 11:15:58,018:INFO: Weight doesn't exsits. /data2/zzh/project/SCI_caption/UniVL/modules/cross-base/cross_pytorch_model.bin
2021-05-25 11:15:58,018:INFO: loading archive file /data2/zzh/project/SCI_caption/UniVL/modules/decoder-base
2021-05-25 11:15:58,019:INFO: Model config {
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"max_target_embeddings": 512,
"num_attention_heads": 12,
"num_decoder_layers": 1,
"num_hidden_layers": 12,
"type_vocab_size": 2,
"vocab_size": 30522
}
2021-05-25 11:15:58,019:INFO: Weight doesn't exsits. /data2/zzh/project/SCI_caption/UniVL/modules/decoder-base/decoder_pytorch_model.bin
2021-05-25 11:15:58,019:WARNING: Stage-One:False, Stage-Two:True
2021-05-25 11:15:58,019:WARNING: Set bert_config.num_hidden_layers: 12.
2021-05-25 11:15:59,122:WARNING: Set visual_config.num_hidden_layers: 6.
2021-05-25 11:15:59,591:WARNING: Set cross_config.num_hidden_layers: 2.
2021-05-25 11:15:59,763:WARNING: Set decoder_config.num_decoder_layers: 3.
2021-05-25 11:16:02,843:INFO: --------------------
2021-05-25 11:16:02,843:INFO: Weights from pretrained model not used in UniVL:
cls.predictions.bias
cls.predictions.transform.dense.weight
cls.predictions.transform.dense.bias
cls.predictions.transform.LayerNorm.weight
cls.predictions.transform.LayerNorm.bias
cls.predictions.decoder.weight
cls_visual.predictions.weight
cls_visual.predictions.bias
cls_visual.predictions.transform.dense.weight
cls_visual.predictions.transform.dense.bias
cls_visual.predictions.transform.LayerNorm.weight
cls_visual.predictions.transform.LayerNorm.bias
similarity_pooler.dense.weight
similarity_pooler.dense.bias
2021-05-25 11:16:10,136:INFO: ***** Running test *****
2021-05-25 11:16:10,136:INFO: Num examples = 2990
2021-05-25 11:16:10,136:INFO: Batch size = 32
2021-05-25 11:16:10,136:INFO: Num steps = 94
2021-05-25 11:23:31,867:INFO: >>> BLEU_1: 0.1410, BLEU_2: 0.0450, BLEU_3: 0.0142, BLEU_4: 0.0052
2021-05-25 11:23:31,877:INFO: >>> METEOR: 0.0684, ROUGE_L: 0.1229, CIDEr: 0.0045
Hi @dawnlh, I suppose that you evaluate the pretrained weight (zero-shot) directly instead of finetuning. You should finetune with --do_train
at first.
Hi @dawnlh, I suppose that you evaluate the pretrained weight (zero-shot) directly instead of finetuning. You should finetune with
--do_train
at first.
Yes, I evaluated the pretrained weight (zero-shot) directly. I tried to finetune the model, but failed due to limited GPU memory (even setting batch_size to 1) . Can you give an estimation about how much GPU memory is needed to finetune the model? Or is it convenient for you to share the weights for captioning task (no transcript) ?
Hi @dawnlh. We finetuned the model with 4 Tesla V100 GPUs. I am so sorry that we can not provide the finetuned weights.
Okay, thanks anyway~ I'll try to figure out the GPU limitation problem. Another question is that if you can provide some instructions or codes on making use of finetuned model to deal with video captioning tasks for self-captured videos? I mean the input video processing (how to extract the same feature as the training set to serve as the model input) and output visualization.