declare-lab/MM-Align

number of frames per utterance for MELD visual features

dondongwon opened this issue · 4 comments

Hi thanks for your work.

I've been checking out the feature shape and it seems that for dialogue 1 utterance 1 there are only 6 frames (which means that it is 6/25 of a second), which doesn't make sense.

image

Is the frame rate of 25 fps accurate?

It seems what you are checking is the tokenization ids so that 6 tokens do not mean the duration is 6/25 sec. The frame rate is only applied to audio.

The last line in the image is the shape of the the video_features, the resnet101 features that was used, which seems to be (6,2048). Resnet extracts features by per image frame, which means that only 6 frames were extracted for the 1th dialogue and 1th utterance.

More clarification would be appreciated. Thanks!

I see. These are pre-align features, which are obtained by average-pooling the 25fps audio/video features to the same length of text ids. This preprocessing just aims to reduce the computational cost, since OT is very slow and original audio/video sequences are too long.

I see, thanks for the information!