microsoft/UniVL

caption using features extracted from my raw video

dawnlh opened this issue · 6 comments

Hi~ sorry for bothering you again.
I have successfully finetuned the model on the caption task with MSRVTT dataset. And following the readme in the dataloader dir, I also successfully extracted the S3D feature from my own raw video, and get a pickle file. But there are some extra files which is need to run the script as listed in the parameter:

DATATYPE="msrvtt"
VAL_CSV="data/msrvtt/MSRVTT_JSFUSION_test.csv"
DATA_PATH="data/msrvtt/MSRVTT_data.json"
FEATURES_PATH="data/msrvtt/msrvtt_videos_features.pickle"
INIT_MODEL="ckpts/ckpt_msrvtt_caption@server-westlakeT0528/pytorch_model.bin.4"
OUTPUT_ROOT="results"

Could you please explain how I can get corresponding VAL_CSV and DATA_PATH files (or how to assigned the parameter) to finish the evaluation of captioning task on my extracted feature? Thanks a lot!

Hi @dawnlh,
Sorry for the delayed reply. I am afraid that you need to rewrite the Dataloader for your video data, refer to dataloader_msrvtt_caption. For the msrvtt, the VAL_CSV is just used to get the feature dim (1024 for S3D here). The important part is the DATA_PATH, the format needs to refer to data/msrvtt/MSRVTT_data.json. In other words, you need to get your self.sentences_dict = {} and self.video_sentences_dict = defaultdict(list) at dataloader_msrvtt_caption.py. If your input is the s3d feature, any placeholder (e.g., NA) is ok for the caption.

Hi @ArrowLuo ,
Following your detailed instructions, I finished the captioning of my raw video now~ But I find that the generated descriptions are not so accurate, as showing below :
image
I know that the descriptions in MSRVTT dataset are simple which limits the accuracy. But have you studied the influence of the video lenght (i.e the frame count of the video) on the description accuracy? Perhaps I should use short videos for accurate captioning? (the video I used above is about 60-100 frames each video). Besides, do you have some other suggestions on improve the captioning accuracy on natural videos (like the data preparation trick, dataset, model, anything~) ? Sincerely thanks!

Hi @dawnlh,

I agree with you on the observation of the results. They are not so accurate, especially on the count and color of cars. I do not think the generation will be better for the short videos. The pretrained weight is got with text and video as input. So the results will be better if there are some other texts (e.g., transcript). Besides, the pretrained dataset may be an important factor for the open-domain caption. For the video caption, there are some works, e.g., GCN on objects. For the pretrain branch, data may be more important than the model.

Hi @ArrowLuo ,
Got it~ Thanks a lot for your suggestions, they are very helpful!

Hello , @dawnlh . I'm a fresher of this field(video caption), i want to run my own data,but I don't know how to rewrite the Dataloader for my video data. Could you share you code about you own Dataloader with me? Thanks a lot!

Hello , @dawnlh . I'm a fresher of this field(video caption), i want to run my own data,but I don't know how to rewrite the Dataloader for my video data. Could you share you code about you own Dataloader with me? Thanks a lot!

Hi @cxy990729 ~ This is my forked repo of UniVL with running instructions (https://github.com/Fork-for-Modify/UniVL/blob/main/run_instructions.md) (in Chinese), hope it's helpful for you.