pytorch/vision

FFmpeg-based rescaling and frame rate

Opened this issue · 4 comments

🚀 Feature

Add support for (basic) FFmpeg filters for faster video pre-processing. In particular, rescaling and changing the frame rate would be useful when feeding in-the-wild videos through a trained model.

Motivation

I am working on a video loader to feed video frames to a model trained on the Kinetics 400 dataset and obtain predictions. The model is trained at a fixed resolution, on videos with a frame rate of 15fps. To support making predictions on videos from various sources, I at least need to resample them at the correct resolution and frame rate.

The current public API only supports decoding of video frames and trimming, but not any other pre-processing, so I need to do any such pre-processing in Python/PyTorch. Such an approach is visibly slower when compared to an implementation based on ffmpeg-python – a wrapper around the command line ffmpeg. For some stats, see Additional context.

Pitch

I would like to start a conversation on how best to bring such functionality to Torchvision. I imagine changing the resolution/fps is a common requirement for making predictions on videos, so I can see it as a useful feature of video I/O. Looking at the C++ code, there is already some support for requesting video frames of a certain resolution [1][2], but this functionality is only exposed in torch.ops.video_reader.read_video_from_file, not the public API. I can’t find anything similar for requesting a certain frame rate.

Is this something that you would want to add to torchvision.io.read_video? What about to torchvision.io.VideoReader? More generally, is there a plan to add support for all FFmpeg filters in the future? What would that interface look like?

Additional context

I’ve done some initial comparisons between torchvision.io.VideoReader + changing frame rate in Python + torch rescaling on batches of 16 frames versus a ffmpeg-python pipeline with scale and fps filters on a 854x480@30fps MP4 input video of ~261s. I’ve included the results below.

Decoding the first seconds of a clip (output fps=15, output size=input size):

clip-length

Decoding 1s of video for given start time (output fps=15, output size=input size):

start-time

Changing the framerate for the first 1s of video (output size=input size):

framerate-1s

Changing the framerate for the first 5s of video (output size=input size):

framerate-5s

Rescaling the first 1s of video (output fps=15):

scale-1s

Rescaling the first 1s of video with bilinear-fast FFMpeg algorithm (output fps=15):

scale-1s-fast

Rescaling the first 5s of video (output fps=15):

scale-5s

cc @bjuncek

Hi,

Thanks for bringing up this issue!

Our current thinking is that most (if not all?) filters in ffmpeg can be implemented with basic python / PyTorch / torchvision operators without too much loss of speed efficiency, and as such there would be limited benefit in packaging the filter logic from ffmpeg in PyTorch (as we would not have GPU / gradient support out of the box).

Your points about the speed for resizing are valid though, and I believe this relates to a current limitation of torchvision resize transform: it currently converts the input tensors to fp32, even if the input is uint8. This means that there is a significant cost of performing resize on single frames compared to alternative implementations which work directly on uint8.
This is something which we plan to improve in the future, which would bring the speed of resizing frames in torchvision to be similar to ffmpeg / opencv.

For the change of framerate, the results you present are interesting, and I wasn't expecting such a large difference.
Could you make the script that you used to obtain those results available so that we can have a look?

Thoughts?

Hi @fmassa,

Thanks for your reply! I put the relevant code into a Colab MWE. The numbers are different compared to the plots above, but they tell the same story.

I agree that the framerate results are a bit strange. The behaviour I would expect is more like the Torchvision curve – a constant time to decode a video clip and a negligible overhead to duplicate/drop frames, so the fps linearly scales with the output frame rate. I’m not entirely sure how ffmpeg scales the way it does, I’ll take a look and see if I find anything.

Edit: I suppose for the ffmpeg case, dropping/duplicating frames does have some non-negligible overhead, as each frame has to be read over a pipe from the ffmpeg process, even if it is a duplicate. In light of that, the ffmpeg curve makes sense.

Thanks for the notebooks @slimm !

I'm still a bit surprised that the 5s resampling example shows FFmpegVideoReader being much faster. Maybe what's going on is that for that video length and Hz, we can jump to keyframes and then do just a few frame decodings directly for faster reading? If that's the case, then this is for now not something that we support in the video reader in torchvision, but it's in the plans.
@bjuncek can you have a look to double-check?

Hi @slimm - thanks a lot for the notebooks, and sorry for the late reply - I've been OOF for the last few days.
I'll take a look at this first thing next week.

My initial thoughts are:

  1. resizing time difference makes sense, as fmassa mentioned, we are converting each frame to fp32, which is suboptimal. Our decoder actually supports ffmpeg based resizing, so adding that would be fairly straightforward; having said that, I agree with fmassa that the approach using transforms makes more sense from a usability perspective if it can be done efficiently.
  2. I'm a bit surprised by the 5s clip curves of resampling the video frame-rate (graph 1 and 4) - I'll take a better look at it next week. Could it be that in our case we decode every single frame, and ffmpeg filter has some way of avoiding this? maybe this would explain why our curve is linear for the graph 4.
  3. For the graph 1, one potential explanation would be better use of multithreading, combined with point above.

I'll test this out a bit further to see if they are doing something different to us, and if maybe resampling would be beneficial to implement in our low level API.

Thanks again,
Bruno