abhiTronix/deffcode

[Idea]: Achieving higher FPS from FFmpeg: Both RAW and RENDER.

Opened this issue · 2 comments

Faster frame rates from FFmpeg by YUV

YUV420 vs RGB24

Hey! It looks like you've got yourself quite the well-built FFmpeg-for-Python package here. I've been down this rabbit hole myself and thought I'd share some tips for optimizing performance. The speed of FFmpeg's input to Python can be massively accelerated by use of the YUV420 format instead of RGB.

The far and away most prevalent video format on the planet today is YUV 4:2:0. MP4's, DVDs, Blu-ray even, all come packed in YUV420... This is because it stores the raw binary of each pixel within an average of 12-bits, instead of the 24-bits per pixel consumed by RGB/BGR formats. So, literally half the diskspace.

In my own testing I've found that asking FFmpeg to output YUV420 video as RGB ends up taking significantly longer than collecting the raw YUV data and transforming it to RGB within Python. And that fact is a bit unintuitive. I would certainly expect FFmpeg's multi-processor operations to handle every aspect of video faster than anything a library in Python could. But the slow down that simply can't be overcome here is in the data pipe itself.

Every frame of RGB video is twice as much data to move through memory space, compared to YUV420. And that access-speed-inhibitor is so massive that even a single-threaded, blocking operation - grabbing one frame at a time from the FFmpeg pipe, shaping that frame into a compatible NumPy array, and pushing the resulting array through an OpenCV transform matrix (YUV2RGB_I420) - is FASTER!

Benchmarks

Render

  • On my AMD Ryzen 3600 2700 (*edit) (6 core / 12 virtual) platform, I get about 75fps from your 'deFFcode' library when rendering a 1080p mp4 to an OpenCV window, using the sample code provided in the documentation.
  • On the same system, my own little script (https://github.com/roninpawn/ffmpeg_videostream/) renders about the same 74fps to a PyGame window when I set it to pull RGB in from the FFmpeg pipe.
  • BUT! When I allow my script to ingest as YUV, then convert to RGB using OpenCV, it renders 102fps to screen via PyGame.

RAW

  • When ingesting frames RAW and discarding them (no draw or other operation) using deFFcode, I see 95fps on the same 1080p clip.
  • Again, this matches (96fps) what my script gets when forced to ingest as RGB.
  • BUT! When I ingest from FFmpeg as YUV, convert to RGB using OpenCV, and then discard the frames, I get 155fps.
  • And the real CRUSHER is: If I ingest that mp4 as YUV, and I DON'T convert it to RGB, the entire video processes at 213fps.

Threading Possbilities for YUV -> RGB

Additionally, even higher speeds are theoretically possible, if multiple threads are established for the YUV -> RGB process separately from a RAW ingest-stacking process.

If one thread is dedicated to queueing 'up to X frames' worth of raw binary snips from the FFmpeg pipe, and another thread is established to process and queue 'up to X frames' of YUV2RGB converted frames, the ingest function ceases to block the processing function and the processing function ceases to block ingest. Resulting in the CPU spinning as fast as it can to have multiple frames ready, for whatever method might come along to ask for one.

Wrap-Up

With appropriate configuration I presume a developer could accomplish everything I've described using your library as it exists today. But I think the following are worth considering across continued development:

  • Implementing YUV as the default ingest for FFmpeg instead of RGB
  • Including stock methods for returning RGB from a raw YUV ingest
  • Threading those methods for maximum throughput
  • Communicating the benefits of YUV 4:2:0 over RGB in documentation

All that said, it really depends on how you choose to define the scope of your library whether any of this fits. I just wanted to pass the results of my own tests and development with FFmpeg in Python, along for your consideration. Again, here's the link to the script I developed atop Karl Kroening's 'ffmpeg-python' library if you care to look it over. It's short and sweet: https://github.com/roninpawn/ffmpeg_videostream/

My Current Environment

  • DeFFcode version: 0.2.0
  • Python version: 3.7.2
  • Operating System and version: Windows 10

@roninpawn Thank you for detailed and well-explained post. It sounds very interesting to me and you've made some very good suggestions.

But let's get down to the nitty-gritty and practically implementating of these ideas. I'll focus on Wrap-up part of the post and will discuss each point one-by-one here:

Implementing YUV as the default ingest for FFmpeg instead of RGB

Yes, but currently RGB(or BGR) is the only accepted pixel format for approximately 98% of the Computer Vision Libraries that I know in Python. And that's huge, so in my opinion it RGB is most adaptable Pixel format for DeFFcode even being on the slower side. And what I think is better is to add a well-made note for the end user, explaining them to use YUV format for achieving faster raw performance while decoding and it's their choice to use frame_format="yuv420" according to their application. What you think?

Including stock methods for returning RGB from a raw YUV ingest
BUT! When I ingest from FFmpeg as YUV, convert to RGB using OpenCV, and then discard the frames, I get 155fps.

This is very good, but I need to do some benchmarks. Actually, it almost seems too good to be true, and I need some internal testing to draw any conclusions. Also, I'm not going to use OpenCV at all in DeFFcode because it is designed in the first place to replace OpenCV in my vidgear library, and rather I'll be implementing a well-optimized YUV to RGB converter here myself in cython.

Threading those methods for maximum throughput

@roninpawn That's a big no. Threading do not mixes well with subprocess module that DeFFcode currently uses to run its FFmpeg Pipeline according to my years of experience with my other vidgear library, and will result in IO errors, missing frames, corrupted output, and other anomalies. However a buffer queue mechanism is a possibility but that will slow things down as seen vidgear's CamGear API. In my opinion as raw YUV ingested FFmpeg pipeline runs faster already, we can leave this enhancement.

Communicating the benefits of YUV 4:2:0 over RGB in documentation

Yeah, that's doable and I'm thinking the same as you. Rather than enforcing YUV in general, let the user decide what they want.

All that said, it really depends on how you choose to define the scope of your library whether any of this fits. I just wanted to pass the results of my own tests and development with FFmpeg in Python, along for your consideration. Again, here's the link to the script I developed atop Karl Kroening's 'ffmpeg-python' library if you care to look it over. It's short and sweet: https://github.com/roninpawn/ffmpeg_videostream/

Yes I'm well aware of your work and I think you did a commendable job with the things in hand. I personally wasn't in favor of ffmpeg-python, and wanted to implement solution myself to have full control over the library. I've also considered other wrappers but some have installation problems, others have no support for hardware decoding: abhiTronix/vidgear#148

Too Good to be True

It does "seem too good to be true," doesn't it! ;) This notion that FFmpeg's insanely optimized multi-processing could have its legs swept by low-level, old-school 'bandwidth' issues, is ridiculously unintuitive! But where I presume that FFmpeg achieves its YUV>RGB transforms faster than the OpenCV library can, FFmpeg still ends up needing to push 6-million bytes of RGB through a pipe, instead of 3-million bytes of YUV -- per frame!

Which I suppose is something like deciding whether to carry an inflatable bed up from the basement, BEFORE or AFTER, you've inflated it. 😄

Benchmarks

While I know this is no supplement for the benchmarks you'll want to craft and run on deFFcode, I have these figures in pocket so I'll share them anyway. These are the results of the RAW access and PyGame-Rendered benchmarks I wrote to test the various implementations of my 'FFmpeg Videostream' class object.

Given the matched results I got testing deFFcode>RGB against my Videostream>RGB script, I would expect you to find similar results. Notwithstanding Intel vs AMD type factors.

    1920x1080:
        --- TEST #1: FFmpeg > RGB24 > RAW (no draw)
        Processed 8216 frames at (1920, 1080) resolution from 'test_video.mp4' in 85.251 seconds.
        Effective rate of 96.4 frames per second.

        --- TEST #2: FFmpeg > YUV420p > RAW (no draw)
        Processed 8216 frames at (1920, 1080) resolution from 'test_video.mp4' in 38.444 seconds.
        Effective rate of 213.7 frames per second.

        --- TEST #3: FFmpeg > YUV420p > OpenCV:RGB (no draw)
        Processed 8216 frames at (1920, 1080) resolution from 'test_video.mp4' in 52.765 seconds.
        Effective rate of 155.7 frames per second.

        --- TEST #4: FFmpeg > YUV420p > OpenCV:BGR (no draw)
        Processed 8216 frames at (1920, 1080) resolution from 'test_video.mp4' in 54.155 seconds.
        Effective rate of 151.7 frames per second.

        Begin Pygame DRAW Tests
        --- TEST #5: FFmpeg > RGB24 > PyGame
        Processed 8216 frames at (1920, 1080) resolution from 'test_video.mp4' in 110.814 seconds.
        Effective rate of 74.1 frames per second.

        --- TEST #6: FFmpeg > YUV420p > OpenCV:RGB > PyGame
        Processed 8216 frames at (1920, 1080) resolution from 'test_video.mp4' in 80.303 seconds.
        Effective rate of 102.3 frames per second.

        --- TEST #7: FFmpeg > YUV420p > OpenCV:BGR > PyGame
        Processed 8216 frames at (1920, 1080) resolution from 'test_video.mp4' in 82.463 seconds.
        Effective rate of 99.6 frames per second.

        --- TEST #8: PyGame scales frame in BGR
        Processed 8216 frames at (1920, 1080) resolution from 'test_video.mp4' in 89.591 seconds.
        Effective rate of 91.7 frames per second.

        --- TEST #9: PyGame scales frame in RGB
        Processed 8216 frames at (1920, 1080) resolution from 'test_video.mp4' in 88.831 seconds.
        Effective rate of 92.5 frames per second.

    1280x720:
        --- TEST #1: FFmpeg > RGB24 > RAW (no draw)
        Processed 3819 frames at (1280, 720) resolution from '720_test.mp4' in 15.701 seconds.
        Effective rate of 243.2 frames per second.

        --- TEST #2: FFmpeg > YUV420p > RAW (no draw)
        Processed 3819 frames at (1280, 720) resolution from '720_test.mp4' in 6.753 seconds.
        Effective rate of 565.5 frames per second.

        --- TEST #3: FFmpeg > YUV420p > OpenCV:RGB (no draw)
        Processed 3819 frames at (1280, 720) resolution from '720_test.mp4' in 10.635 seconds.
        Effective rate of 359.1 frames per second.

        --- TEST #4: FFmpeg > YUV420p > OpenCV:BGR (no draw)
        Processed 3819 frames at (1280, 720) resolution from '720_test.mp4' in 10.553 seconds.
        Effective rate of 361.9 frames per second.

        Begin Pygame DRAW Tests
        --- TEST #5: FFmpeg > RGB24 > PyGame
        Processed 3819 frames at (1280, 720) resolution from '720_test.mp4' in 21.44 seconds.
        Effective rate of 178.1 frames per second.

        --- TEST #6: FFmpeg > YUV420p > OpenCV:RGB > PyGame
        Processed 3819 frames at (1280, 720) resolution from '720_test.mp4' in 15.994 seconds.
        Effective rate of 238.8 frames per second.

        --- TEST #7: FFmpeg > YUV420p > OpenCV:BGR > PyGame
        Processed 3819 frames at (1280, 720) resolution from '720_test.mp4' in 16.04 seconds.
        Effective rate of 238.1 frames per second.

        --- TEST #8: PyGame scales frame in BGR
        Processed 3819 frames at (1280, 720) resolution from '720_test.mp4' in 17.386 seconds.
        Effective rate of 219.7 frames per second.

        --- TEST #9: PyGame scales frame in RGB
        Processed 3819 frames at (1280, 720) resolution from '720_test.mp4' in 17.493 seconds.
        Effective rate of 218.3 frames per second.

I can add to this benchmark that in an experimental build of my script I added a "Threaded Frames Extractor" method based on Benjamin Lowe's work here: https://github.com/bml1g12/benchmarking_video_reading_python, and liberated an extra 10-15fps over TEST 3 from the list above. The method simply stacks a threaded queue of unprocessed frames from the bytestream. I have not tested the virtue of threading the YUV > RGB process, either alone or in conjunction with this.

YUV > RGB Transform Method

I'm thrilled to hear that you mean to implement your own method(s) for color-space conversion! My first thought, once I realized the throughput benefit of ingesting as YUV, was "there's got to be a better transform than OpenCV out there!" In particular, I was hoping for a hardware implementation.

I don't know the hardware end of things, nor the difficulties of accessing dedicated GPU functions from within Python. All I know is that MP4 is decoded at a relative BLAZE even on weakly-powered tablets and phones, where manufacturers seek to advertise the many 'hours of Netflix' you can watch on a single battery charge. With that, my assumption is: modern processors include dedicated hardware for the screen-rendering and decoding of YUV420p and its common variants. But, like I say: These are darts thrown in the dark.

I'm also currently under the impression that Simple DirectMedia Layer (SDL) https://www.libsdl.org/ implements methods that render a YUV pipe direct to screen. And I intended to look into this at some point.

What do I think?

It's a difficult call. My benchmarks suggest a 25-33% performance increase with YUV ingest + local RGB conversion. And that's a LOT of performance to leave on the table. Especially knowing that most developers will never so much as dream there could be such massive benefits by this approach.

On the other hand, it is highly specific. And there are additional concerns where increased CPU usage comes into play. It's been my experience that FFmpeg, left to handle the YUV > RGB conversion on it's own consumes about 50% of my CPU while active. When OpenCV is implemented for the conversion, I see a bump of around 10% additional core usage. Which could be a factor to a given developer.

Also, 12-bit YUV is technically a lossy format, I believe. Some color information is lost in reducing the storage footprint. The human eye doesn't notice the difference, and subsampling helps maintain the accuracy of input to output. But there's no guarantee of perfect fidelity.

That's not a problem if the source is already in this -- the most-common format on the planet. But if the source is not already 12-bit YUV, implicitly forcing an RGB or BGR source to 12-bit YUV would not be suitable for fine scientific purposes, where the color-averaging that occurs across pixel quadrants would skew results. (Don't just take my word for everything in these last 2 paragraphs -- this is what I THINK I understand of it all)

With those considerations: I suppose I agree. deFFcode should not default FFmpeg's output to YUV420p.
(Acknowledging that even 24-bit RGB is implicitly lossy upon a 32-bit RGB source. And you kind of just have to pick a default.)

But I also feel that merely documenting the benefits of the proposed FFmpeg > YUV420p > RGB method isn't enough. Many will never discover that for the cost of two extra lines of code they might increase the speed of their application by a full 1/3rd; That for every hour-long job they queue, they might've been loading up the next one in just 40 minutes.

With that in mind I would propose for consideration: A public method that implements this local YUV > RGB concept. One that can be optimized and maintained by the library's developers; which implements the library's own conversion matrix, and that is perhaps even benefited by the kind of queue'd frame preparation Benjamin Lowe tested in his benchmarks.

Because ultimately, I see it as a public-awareness challenge that needs overcome. And by maintaining a stock-method, (especially if appropriately named) unfamiliar developers draw nearer to discovering what - in likely the MAJORITY of all use cases - is measurably the fastest, most powerful, and best solution going. Noting again that the majority of the world's video media is encoded and circulated in this format, and that the stock FFmpeg>RGB method, leaves 25-33% of attainable efficiency untapped.

Those are my thoughts.