intel/libva

VA-API hardware decoding is slower than software decoding on Intel Celeron N4000

Talkless opened this issue · 16 comments

We have some very small Chinese mini-PC that has Intel N4000.

I've installed Debian 12 in it, with VA-API:

$ vainfo 
libva info: VA-API version 1.17.0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/iHD_drv_video.so
libva info: Found init function __vaDriverInit_1_17
libva info: va_openDriver() returns 0
vainfo: VA-API version: 1.17 (libva 2.12.0)
vainfo: Driver version: Intel iHD driver for Intel(R) Gen Graphics - 23.1.1 ()
vainfo: Supported profile and entrypoints
      VAProfileNone                   : VAEntrypointVideoProc
      VAProfileNone                   : VAEntrypointStats
      VAProfileMPEG2Simple            : VAEntrypointVLD
      VAProfileMPEG2Main              : VAEntrypointVLD
      VAProfileH264Main               : VAEntrypointVLD
      VAProfileH264Main               : VAEntrypointEncSlice
      VAProfileH264Main               : VAEntrypointFEI
      VAProfileH264Main               : VAEntrypointEncSliceLP
      VAProfileH264High               : VAEntrypointVLD
      VAProfileH264High               : VAEntrypointEncSlice
      VAProfileH264High               : VAEntrypointFEI
      VAProfileH264High               : VAEntrypointEncSliceLP
      VAProfileVC1Simple              : VAEntrypointVLD
      VAProfileVC1Main                : VAEntrypointVLD
      VAProfileVC1Advanced            : VAEntrypointVLD
      VAProfileJPEGBaseline           : VAEntrypointVLD
      VAProfileJPEGBaseline           : VAEntrypointEncPicture
      VAProfileH264ConstrainedBaseline: VAEntrypointVLD
      VAProfileH264ConstrainedBaseline: VAEntrypointEncSlice
      VAProfileH264ConstrainedBaseline: VAEntrypointFEI
      VAProfileH264ConstrainedBaseline: VAEntrypointEncSliceLP
      VAProfileVP8Version0_3          : VAEntrypointVLD
      VAProfileVP8Version0_3          : VAEntrypointEncSlice
      VAProfileHEVCMain               : VAEntrypointVLD
      VAProfileHEVCMain               : VAEntrypointEncSlice
      VAProfileHEVCMain               : VAEntrypointFEI
      VAProfileHEVCMain10             : VAEntrypointVLD
      VAProfileHEVCMain10             : VAEntrypointEncSlice
      VAProfileVP9Profile0            : VAEntrypointVLD
      VAProfileVP9Profile2            : VAEntrypointVLD

I'm using GStreamer 1.22.1 with vah264dec element, but on this machine (works fine on other Celerons) I get only about ~16FPS for 720p, while using avdec_h264 software decoder element (ffmpeg) I can get full 25fps.

intel_gpu_top does show that "Video" usage is non-zero with vah264dec, and zero with software decoding, so I assume it does in principle work..?

GStreamer logs while playing videounder vah264dec:

0:00:13.954733102 16454 0x561480aa7c00 WARN            videodecoder gstvideodecoder.c:3668:gst_video_decoder_clip_and_push_buf:<vah264dec0> Dropping frame due to QoS. start:0:00:12.719919487 deadline:0:00:12.719919487 earliest_time:0:00:13.347737097
0:00:13.955002097 16454 0x561480aa7c00 WARN            videodecoder gstvideodecoder.c:3668:gst_video_decoder_clip_and_push_buf:<vah264dec0> Dropping frame due to QoS. start:0:00:12.759917944 deadline:0:00:12.759917944 earliest_time:0:00:13.347737097
0:00:13.961621624 16454 0x561480aa7c00 WARN            videodecoder gstvideodecoder.c:3668:gst_video_decoder_clip_and_push_buf:<vah264dec0> Dropping frame due to QoS. start:0:00:12.799916413 deadline:0:00:12.799916413 earliest_time:0:00:13.347737097

I'm not really sure if I should report this issue here or to GStreamer though, so sorry if misjudged, though it seemed as if something's wrong with VA driver.

how about media engine usage from intel_gpu_top?
and what's the whole gst command line?

This is what I see in intel_gpu_top:
paveikslas

Where Viewer is our Qt application with GStreamer playback.

GST pipeline:

rtspsrc location=rtsp://... protocols=tcp latency=100 buffer-mode=slave ! queue max-size-buffers=0 ! rtph264depay ! h264parse ! vah264dec compliance=3 ! glupload ! glcolorconvert ! qmlglsink

Same issue with Dropping frame due to QoS if I use it via gst-launch and glimagesink in terminal.

Looks like it's the similar performance issue with another computer having Celeron J4125.

It renders 720p at about 18-20fps (while original stream is 25fps), and 1080p is rendered only at ~9fps, meawhile software decoder can handle 1080p at full 25fps.

It has Debian 11 though, I can try installing 12.

N4500 works fine if I boot Debian 11 by forcing GPU detection with i915.force_probe=4e55.

J3060 and I believe J1900 worked fine too.

I've upgraded J4125 machine to Debian Sid, and now it handles TWO video streams at 1080p at 25fps.

I'll try to upgrade N4000 to Sid too.

Just upgrade N4000 to Sid too.

vainfo:

r$ vainfo 
libva info: VA-API version 1.19.0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/iHD_drv_video.so
libva info: Found init function __vaDriverInit_1_18
libva info: va_openDriver() returns 0
vainfo: VA-API version: 1.19 (libva 2.12.0)
vainfo: Driver version: Intel iHD driver for Intel(R) Gen Graphics - 23.2.3 ()
vainfo: Supported profile and entrypoints
      VAProfileNone                   : VAEntrypointVideoProc
      VAProfileNone                   : VAEntrypointStats
      VAProfileMPEG2Simple            : VAEntrypointVLD
      VAProfileMPEG2Main              : VAEntrypointVLD
      VAProfileH264Main               : VAEntrypointVLD
      VAProfileH264Main               : VAEntrypointEncSlice
      VAProfileH264Main               : VAEntrypointFEI
      VAProfileH264Main               : VAEntrypointEncSliceLP
      VAProfileH264High               : VAEntrypointVLD
      VAProfileH264High               : VAEntrypointEncSlice
      VAProfileH264High               : VAEntrypointFEI
      VAProfileH264High               : VAEntrypointEncSliceLP
      VAProfileVC1Simple              : VAEntrypointVLD
      VAProfileVC1Main                : VAEntrypointVLD
      VAProfileVC1Advanced            : VAEntrypointVLD
      VAProfileJPEGBaseline           : VAEntrypointVLD
      VAProfileJPEGBaseline           : VAEntrypointEncPicture
      VAProfileH264ConstrainedBaseline: VAEntrypointVLD
      VAProfileH264ConstrainedBaseline: VAEntrypointEncSlice
      VAProfileH264ConstrainedBaseline: VAEntrypointFEI
      VAProfileH264ConstrainedBaseline: VAEntrypointEncSliceLP
      VAProfileVP8Version0_3          : VAEntrypointVLD
      VAProfileVP8Version0_3          : VAEntrypointEncSlice
      VAProfileHEVCMain               : VAEntrypointVLD
      VAProfileHEVCMain               : VAEntrypointEncSlice
      VAProfileHEVCMain               : VAEntrypointFEI
      VAProfileHEVCMain10             : VAEntrypointVLD
      VAProfileHEVCMain10             : VAEntrypointEncSlice
      VAProfileVP9Profile0            : VAEntrypointVLD
      VAProfileVP9Profile2            : VAEntrypointVLD

Sadly, upgrade didn't help. N4000 manages only about 16-17fps @ 720p, and 9fps on 1080p.

from intel gpu top. the video utilization is 2.43%, it is almost free, so, it is not a decode issue, it maybe caused by other reason.
AFAIK, it could decode multiple sessions.
I guess, it related with the glcolorconvert, @xhaihao could you help to check the command line, suppose it is not a suitable one.

@Talkless There should be a data copy between vah264dec and glupload, could you check the used caps ? You may specify video/x-raw(memory:DMABuf) if you want to avoid data copy.

If it's data copy issue, why it disappears for J4125 if I upgrade to Debian Sid while using same my own built GStreamer 1.22.1 binaries (I don't use distribution GStreamer packages)?

My hypothesis is that newer va-api drivers fixed it (I'm using non-free variants in Debian, such as i965-va-driver-shaders and intel-media-va-driver-non-free).

I'll try to fiddle with caps and will try to render pipeline visualization to see what it's doing though, thanks for the hints.

EDIT: I take my words about J4125 working on Sid back. Just upgraded form 12 to Sid again and I don't see performance fixed. Not sure why I was sure about it working OK. Sorry, gotta do more research.

Now that's discovery for me:

paveikslas

Even thought vah264dec and glupload both support DMABuf, it is not used by default.. video/x-raw is used. So I guess if system is fast enough, I did not noticed copying penalty, so I guess you're right. I just need to specify caps correctly because so far I failed to make it work...

If I explicitly use "slow" version like this: ... vah264dec ! video/x-raw ! glimagesink it works as it was before, but if I specify video/x-raw(memory:DMABuf) instead it fails with kinda irrelevant error message failed delayed linking some pad of GstQTDemux named qtdemux0 to some pad of GstH264Parse named h264parse0 using this testing pipeline:

$ ./gst-launch-1.0  curlhttpsrc location="https://ia800201.us.archive.org/12/items/BigBuckBunny_328/BigBuckBunny_512kb.mp4" ! qtdemux! h264parse ! queue ! vah264dec ! "video/x-raw(memory:DMABuf)" ! glimagesink 
Setting pipeline to PAUSED ...
Pipeline is PREROLLING ...
Got context from element 'sink': gst.gl.GLDisplay=context, gst.gl.GLDisplay=(GstGLDisplay)"\(GstGLDisplayX11\)\ gldisplayx11-0";
Got context from element 'vah264dec0': gst.va.display.handle=context, gst-display=(GstObject)"\(GstVaDisplayDrm\)\ vadisplaydrm1", description=(string)"Intel\ iHD\ driver\ for\ Intel\(R\)\ Gen\ Graphics\ -\ 22.2.1\ \(\)", path=(string)/dev/dri/renderD128;
ERROR: from element /GstPipeline:pipeline0/GstCurlHttpSrc:curlhttpsrc0: Internal data stream error.
Additional debug info:
../src/libs/gst/base/gstbasesrc.c(3132): gst_base_src_loop (): /GstPipeline:pipeline0/GstCurlHttpSrc:curlhttpsrc0:
streaming stopped, reason not-linked (-1)
ERROR: pipeline doesn't want to preroll.
WARNING: from element /GstPipeline:pipeline0/GstQTDemux:qtdemux0: Delayed linking failed.
Additional debug info:
gst/parse/grammar.y(853): gst_parse_no_more_pads (): /GstPipeline:pipeline0/GstQTDemux:qtdemux0:
failed delayed linking some pad of GstQTDemux named qtdemux0 to some pad of GstH264Parse named h264parse0
Setting pipeline to NULL ...
Freeing pipeline ...

Maybe I need to enable DMABuf suport some how in my systems..?

So the issue was that GStreamer does not support DMABuf with GLX. It works with EGL if I specify GST_GL_PLATFORM=egl env. variable!

For example, using:

env GST_GL_PLATFORM=egl ./gst-launch-1.0 filesrc location=bbb_sunflower_1080p_30fps_normal.mp4 ! qtdemux! h264parse ! queue ! vah264dec ! glimagesink

I get flawless playback on N4000 with just around ~6% CPU, meanwhile if I go back to GLX I get ~100% CPU with lots of frame dropped. video/x-raw(memory:DMABuf) is ignored with GLX.

Closing as invalid.

cool, how about intel_gpu_top result? Suppose you could playback several bitstream simultaneously

@XinfengZhang yeah it could play 4 streams. Well not flawlessly (with some rare stuttering), maybe 3 x 1080p would be more realistic/practical. Pretty good result for such a tiny "hdmi stick" pc like this:

paveikslas

intel_gpu_top with four players:

paveikslas

yes, video engine still have rooms to decode more streams, but render engine utilization is full

@XinfengZhang Might work better without whole Gnome desktop, etc. In Wayland, etc. But still, great stuff. Thanks!