Commands to MQ Training with VSGN
JunweiLiang opened this issue · 10 comments
Hi, thanks for releasing the code!
Could you provide some instructions on how to run VSGN training with EgoVLP features (hyper-parameters, learning rate, etc.)? Thanks!
Junwei
Hello Junwei,
Thanks for your interest for our work,
I will update the instruction and related details for MQ next,
Thank you for your patience!
Hi Junwei,
I have uploaded the video features for MQ tasks to G drive: train&val / test, so that you can download it directly.
What you need to do is replace the input features with our features.
and I have attached our config of the best VSGN model in here config.txt.
Please try it out and let us know if you have new results.
I have downloaded the features but they seem to be a single file. Are they a single pickle binary with dictionary keys? How to read them and map them to the videos (for example, slowfast8x8_r101_k400/ has 9645 *.pt files each corresponds to a video)?
Thanks.
There is a gz file, after unzipping it (I unzip it on my mac), you will see a document that contains multiple *.pt
.
e.g., 0a8f6747-7f79-4176-85ca-f5ec01a15435.pt
, this pt file corresponding to the video features of the clip: 0a8f6747-7f79-4176-85ca-f5ec01a15435
.
The clip information is provided by the MQ metadata, i.e., clip xxx come from the video yyy with start time t1 and end time t2.
I see. The file you provided on Google drive is a .tar.gz file, and I extract it with tar -zxf
and got 2034 *.pt file for the train/val part. Will try them.
So 0a8f6747-7f79-4176-85ca-f5ec01a15435
is the clip ID instead of video ID? Could you provide the feature files of the whole video as the VSGN baseline? They read the feature of the whole video and then cut the corresponding clip (see here). To follow your instructions I would need this video-level features.
Thanks.
Yes, it is the clip ID. And sorry, I am currently unable to provide video-level features, a solution is to rewrite the data loader so that supports clip features as input.
@QinghongLin - Thanks for providing the clip features. I tried training the VSGN model using the Ego4D episodic-memory codebase instructions. But I'm not able to reproduce the val results from the paper. The numbers are quite a bit lower than the paper results (2nd row vs. 3rd row in the figure below).
Here is the training command I used. Note: I modified the data loader to use clip features instead of video features.
python Train.py \
--use_xGPN \
--is_train true \
--dataset ego4d \
--feature_path data/egovlp_feats_official \
--checkpoint_path checkpoints/ \
--tb_dir tb/ \
--batch_size 24 \
--train_lr 0.00005 \
--use_clip_features true \
--input_feat_dim 256 \
--num_epoch 100
Hi, @srama2512 ,
I released the codebase here MQ.zip, you can check the data loader detail regarding clip-level feature loading.
Besides, I am able to check the config parameters, can you have a try at the following parameters?
{'dataset': 'ego4d', 'is_train': 'true', 'out_prop_map': 'true', 'feature_path': '/mnt/sdb1/Datasets/Ego4d/action_feature_canonical', 'clip_anno': 'Evaluation/ego4d/annot/clip_annotations.json', 'moment_classes': 'Evaluation/ego4d/annot/moment_classes_idx.json', 'checkpoint_path': 'checkpoint', 'output_path': './outputs/hps_search_egovlp_egonce_features/23/', 'prop_path': 'proposals', 'prop_result_file': 'proposals_postNMS.json', 'detect_result_file': 'detections_postNMS.json', 'retrieval_result_file': 'retreival_postNMS.json', 'detad_sensitivity_file': 'detad_sensitivity', 'batch_size': 32, 'train_lr': 5e-05, 'weight_decay': 0.0001, 'num_epoch': 50, 'step_size': 15, 'step_gamma': 0.1, 'focal_alpha': 0.01, 'nms_alpha_detect': 0.46, 'nms_alpha_prop': 0.75, 'nms_thr': 0.4, 'temporal_scale': 928, 'input_feat_dim': 2304, 'bb_hidden_dim': 256, 'decoder_num_classes': 111, 'num_levels': 5, 'num_head_layers': 4, 'nfeat_mode': 'feat_ctr', 'num_neigh': 12, 'edge_weight': 'false', 'agg_type': 'max', 'gcn_insert': 'par', 'iou_thr': [0.5, 0.5, 0.7], 'anchor_scale': [1, 10], 'base_stride': 1, 'stitch_gap': 30, 'short_ratio': 0.4, 'clip_win_size': 0.38, 'use_xGPN': False, 'use_VSS': False, 'num_props': 200, 'tIoU_thr': [0.1, 0.2, 0.3, 0.4, 0.5], 'eval_stage': 'all', 'infer_datasplit': 'val'}
@QinghongLin - Thanks for sharing your code and the hyperparameters. I was able to obtain a similar performance. It turns out that there was a bug in the test_mq.py
feature-extraction code that I used. I modified test_mq.py
to increase the batch size here to 128.
Lines 77 to 87 in dc4a60f
The calculation of times = data['video'].shape[0] // batch
does not work when video shape is not a multiple of the batch. It gets much worse when we increase the batch, leaving a residual set of all-zero features in the end. After changing that part of the code to the snippet below, it works as expected.
if data['video'].shape[0]% batch == 0:
times = data['video'].shape[0] // batch
else:
times = data['video'].shape[0] // batch + 1
Happy to send a PR if you'd like this bug-fix to be a part of the EgoVLP repo. This affects most of the test_*.py
and causes a significant issue if anyone increases batch
.