number of frames per utterance for MELD visual features
dondongwon opened this issue · 4 comments
It seems what you are checking is the tokenization ids so that 6 tokens do not mean the duration is 6/25 sec. The frame rate is only applied to audio.
The last line in the image is the shape of the the video_features, the resnet101 features that was used, which seems to be (6,2048). Resnet extracts features by per image frame, which means that only 6 frames were extracted for the 1th dialogue and 1th utterance.
More clarification would be appreciated. Thanks!
I see. These are pre-align features, which are obtained by average-pooling the 25fps audio/video features to the same length of text ids. This preprocessing just aims to reduce the computational cost, since OT is very slow and original audio/video sequences are too long.
I see, thanks for the information!