number of frames per utterance for MELD visual features

Question

number of frames per utterance for MELD visual features

dondongwon opened this issue a year ago · 4 comments

Hi thanks for your work.

I've been checking out the feature shape and it seems that for dialogue 1 utterance 1 there are only 6 frames (which means that it is 6/25 of a second), which doesn't make sense.

Is the frame rate of 25 fps accurate?

Answer 1 · 2023-04-15T05:17:54.000Z

It seems what you are checking is the tokenization ids so that 6 tokens do not mean the duration is 6/25 sec. The frame rate is only applied to audio.

Answer 2 · 2023-04-15T05:27:36.000Z

The last line in the image is the shape of the the video_features, the resnet101 features that was used, which seems to be (6,2048). Resnet extracts features by per image frame, which means that only 6 frames were extracted for the 1th dialogue and 1th utterance.

More clarification would be appreciated. Thanks!

Answer 3 · 2023-04-15T05:51:59.000Z

I see. These are pre-align features, which are obtained by average-pooling the 25fps audio/video features to the same length of text ids. This preprocessing just aims to reduce the computational cost, since OT is very slow and original audio/video sequences are too long.

Answer 4 · 2023-04-17T16:03:51.000Z

I see, thanks for the information!