gorjanradevski/revisiting-spatial-temporal-layouts

Mismatch between appearance_num_frames and feature-size

Closed this issue · 7 comments

Hi Gorjan

I am running into an issue with setting the number of frames to sample from each video. To put it into context, I need to classify 1s clips at a time, which amount to 25 frames, and hence, cannot sample more than those. The current setup is 32 frames, and I changed appearance_num_frames to 25. However, this may be interfering with the forward_features() method of TransformerResnet. It seems that the ResNet outputs by default 32 sequence length, and not sure if this can be modified.

The error happens in models.py line 267, when it tries to join it with the position embedding. Any idea how I can rectify? am I interpreting the appearance_num_frames correctly to begin with?

Can you post the error too?

File "{HOME}/STLT/src/modelling/models.py", line 267, in forward_features
features = features + self.pos_embed
RuntimeError: The size of tensor a (33) must match the size of tensor b (26) at non-singleton dimension 0

Also, noticed that the sampling range deducts 2 (e.g. data_utils.py line 77). Why is that?

Finally (and sorry for being so pedantic), why the multiplication by sample_rate?: specificlaly, you have a comment on line 64 of data_utils.py that says 16 * 2, but the default appearance_num_frames = 32.

Thanks

Ok, I think I know what's going on. A preface first, some of the code (sample_appearance_indices, sample_train_layout_indices, get_test_layout_indices) is taken from a paper we compare with (for fair and accurate comparison). I checked that their code is correct, but I didn't do a deep dive, hence I can't answer your question -- it's best to copy-paste the indices sampling code in Jupyter notebook and have a look what exactly is going on.

In any case, the sample_appearance_indices method is called such that coord_nr_frames indicates how many frames you want to sample from the video, in train.py it's modified by args.appearance_num_frames, and nr_video_frames is the number of frames in the video, in your case 25. Therefore, if you do frame_indices=sample_appearance_indices(32, 25, train=False), you'll get 32 frame indices sampled from your video, where there will be some duplicates, because your video has less than 32 frames. How many frames to sample, that depends. I would suggest you tune that parameter. In my case, I just took it from the baseline paper.

Actually, it seems that sample_appearance_indices would in this case just return only 25 frames (although the sampling is weird).

For my code, I am just doing a simpler sampler, which just spits out all frames if < args.appearance_num_frames and an ordered random choice otherwise.

Incidentally, I am doing some updates on my own personal branch. Let me know if you would be interested in them.

Michael

Good luck with your research! Just beware, randomly sampling frames might cause issues, can't point to a specific reference, but something that you might want to double check in case you're not getting the expected results.

Thanks for the heads up (that is why I found their code so weird, but it seems a lot of other systems do it, see e.g. the MMAction2 library)

In my case, I am just using all the frames in the 1s clip anyway.

Will keep you posted

I'm closing this issue, feel free to reopen if the problem persists.