In the inference.ipynb, each video is extracted to a length of 5s, but how to deal with if the video sizes are different and some videos are short than 5s?

Question

In the inference.ipynb, each video is extracted to a length of 5s, but how to deal with if the video sizes are different and some videos are short than 5s?

sulizhi opened this issue a year ago · 2 comments

Answer 1 · 2023-06-24T03:44:50.000Z

The basic unit for the video models is "16x4" or we sample 16 frames, each 4 frames apart. Since we trained with 30fps video, that means that you need at least 64 frames @ 30 fps or 2.133s of video to run the model. If you don't have that many seconds of video, perhaps you can duplicate frames? Or sample with 2 frames of gap instead, etc. (but I can't guarantee accuracy then).

Then for clips longer than 2.133 seconds, just do as I did in the notebook: pass multiple clips into the model and average the results. So in the notebook example, I takes 5s of video and extract the first 128 frames (or first 4.266 seconds). If you have a longer video, just increase the transcoding duration and sample more clips.

Answer 2 · 2023-07-05T08:41:04.000Z

The problem has been solved! Thank you very much!