VA-API hardware decoding is slower than software decoding on Intel Celeron N4000
Talkless opened this issue · 16 comments
We have some very small Chinese mini-PC that has Intel N4000.
I've installed Debian 12 in it, with VA-API:
$ vainfo
libva info: VA-API version 1.17.0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/iHD_drv_video.so
libva info: Found init function __vaDriverInit_1_17
libva info: va_openDriver() returns 0
vainfo: VA-API version: 1.17 (libva 2.12.0)
vainfo: Driver version: Intel iHD driver for Intel(R) Gen Graphics - 23.1.1 ()
vainfo: Supported profile and entrypoints
VAProfileNone : VAEntrypointVideoProc
VAProfileNone : VAEntrypointStats
VAProfileMPEG2Simple : VAEntrypointVLD
VAProfileMPEG2Main : VAEntrypointVLD
VAProfileH264Main : VAEntrypointVLD
VAProfileH264Main : VAEntrypointEncSlice
VAProfileH264Main : VAEntrypointFEI
VAProfileH264Main : VAEntrypointEncSliceLP
VAProfileH264High : VAEntrypointVLD
VAProfileH264High : VAEntrypointEncSlice
VAProfileH264High : VAEntrypointFEI
VAProfileH264High : VAEntrypointEncSliceLP
VAProfileVC1Simple : VAEntrypointVLD
VAProfileVC1Main : VAEntrypointVLD
VAProfileVC1Advanced : VAEntrypointVLD
VAProfileJPEGBaseline : VAEntrypointVLD
VAProfileJPEGBaseline : VAEntrypointEncPicture
VAProfileH264ConstrainedBaseline: VAEntrypointVLD
VAProfileH264ConstrainedBaseline: VAEntrypointEncSlice
VAProfileH264ConstrainedBaseline: VAEntrypointFEI
VAProfileH264ConstrainedBaseline: VAEntrypointEncSliceLP
VAProfileVP8Version0_3 : VAEntrypointVLD
VAProfileVP8Version0_3 : VAEntrypointEncSlice
VAProfileHEVCMain : VAEntrypointVLD
VAProfileHEVCMain : VAEntrypointEncSlice
VAProfileHEVCMain : VAEntrypointFEI
VAProfileHEVCMain10 : VAEntrypointVLD
VAProfileHEVCMain10 : VAEntrypointEncSlice
VAProfileVP9Profile0 : VAEntrypointVLD
VAProfileVP9Profile2 : VAEntrypointVLD
I'm using GStreamer 1.22.1 with vah264dec
element, but on this machine (works fine on other Celerons) I get only about ~16FPS for 720p, while using avdec_h264
software decoder element (ffmpeg) I can get full 25fps.
intel_gpu_top
does show that "Video" usage is non-zero with vah264dec
, and zero with software decoding, so I assume it does in principle work..?
GStreamer logs while playing videounder vah264dec
:
0:00:13.954733102 16454 0x561480aa7c00 WARN videodecoder gstvideodecoder.c:3668:gst_video_decoder_clip_and_push_buf:<vah264dec0> Dropping frame due to QoS. start:0:00:12.719919487 deadline:0:00:12.719919487 earliest_time:0:00:13.347737097
0:00:13.955002097 16454 0x561480aa7c00 WARN videodecoder gstvideodecoder.c:3668:gst_video_decoder_clip_and_push_buf:<vah264dec0> Dropping frame due to QoS. start:0:00:12.759917944 deadline:0:00:12.759917944 earliest_time:0:00:13.347737097
0:00:13.961621624 16454 0x561480aa7c00 WARN videodecoder gstvideodecoder.c:3668:gst_video_decoder_clip_and_push_buf:<vah264dec0> Dropping frame due to QoS. start:0:00:12.799916413 deadline:0:00:12.799916413 earliest_time:0:00:13.347737097
I'm not really sure if I should report this issue here or to GStreamer though, so sorry if misjudged, though it seemed as if something's wrong with VA driver.
how about media engine usage from intel_gpu_top?
and what's the whole gst command line?
This is what I see in intel_gpu_top
:
Where Viewer
is our Qt application with GStreamer playback.
GST pipeline:
rtspsrc location=rtsp://... protocols=tcp latency=100 buffer-mode=slave ! queue max-size-buffers=0 ! rtph264depay ! h264parse ! vah264dec compliance=3 ! glupload ! glcolorconvert ! qmlglsink
Same issue with Dropping frame due to QoS
if I use it via gst-launch
and glimagesink
in terminal.
Looks like it's the similar performance issue with another computer having Celeron J4125
.
It renders 720p
at about 18-20fps (while original stream is 25fps), and 1080p
is rendered only at ~9fps, meawhile software decoder can handle 1080p
at full 25fps.
It has Debian 11 though, I can try installing 12.
N4500
works fine if I boot Debian 11 by forcing GPU detection with i915.force_probe=4e55
.
J3060
and I believe J1900
worked fine too.
I've upgraded J4125
machine to Debian Sid, and now it handles TWO video streams at 1080p at 25fps.
I'll try to upgrade N4000 to Sid too.
Just upgrade N4000 to Sid too.
vainfo:
r$ vainfo
libva info: VA-API version 1.19.0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/iHD_drv_video.so
libva info: Found init function __vaDriverInit_1_18
libva info: va_openDriver() returns 0
vainfo: VA-API version: 1.19 (libva 2.12.0)
vainfo: Driver version: Intel iHD driver for Intel(R) Gen Graphics - 23.2.3 ()
vainfo: Supported profile and entrypoints
VAProfileNone : VAEntrypointVideoProc
VAProfileNone : VAEntrypointStats
VAProfileMPEG2Simple : VAEntrypointVLD
VAProfileMPEG2Main : VAEntrypointVLD
VAProfileH264Main : VAEntrypointVLD
VAProfileH264Main : VAEntrypointEncSlice
VAProfileH264Main : VAEntrypointFEI
VAProfileH264Main : VAEntrypointEncSliceLP
VAProfileH264High : VAEntrypointVLD
VAProfileH264High : VAEntrypointEncSlice
VAProfileH264High : VAEntrypointFEI
VAProfileH264High : VAEntrypointEncSliceLP
VAProfileVC1Simple : VAEntrypointVLD
VAProfileVC1Main : VAEntrypointVLD
VAProfileVC1Advanced : VAEntrypointVLD
VAProfileJPEGBaseline : VAEntrypointVLD
VAProfileJPEGBaseline : VAEntrypointEncPicture
VAProfileH264ConstrainedBaseline: VAEntrypointVLD
VAProfileH264ConstrainedBaseline: VAEntrypointEncSlice
VAProfileH264ConstrainedBaseline: VAEntrypointFEI
VAProfileH264ConstrainedBaseline: VAEntrypointEncSliceLP
VAProfileVP8Version0_3 : VAEntrypointVLD
VAProfileVP8Version0_3 : VAEntrypointEncSlice
VAProfileHEVCMain : VAEntrypointVLD
VAProfileHEVCMain : VAEntrypointEncSlice
VAProfileHEVCMain : VAEntrypointFEI
VAProfileHEVCMain10 : VAEntrypointVLD
VAProfileHEVCMain10 : VAEntrypointEncSlice
VAProfileVP9Profile0 : VAEntrypointVLD
VAProfileVP9Profile2 : VAEntrypointVLD
Sadly, upgrade didn't help. N4000 manages only about 16-17fps @ 720p, and 9fps on 1080p.
from intel gpu top. the video utilization is 2.43%, it is almost free, so, it is not a decode issue, it maybe caused by other reason.
AFAIK, it could decode multiple sessions.
I guess, it related with the glcolorconvert, @xhaihao could you help to check the command line, suppose it is not a suitable one.
@Talkless There should be a data copy between vah264dec and glupload, could you check the used caps ? You may specify video/x-raw(memory:DMABuf) if you want to avoid data copy.
If it's data copy issue, why it disappears for J4125
if I upgrade to Debian Sid while using same my own built GStreamer 1.22.1 binaries (I don't use distribution GStreamer packages)?
My hypothesis is that newer va-api drivers fixed it (I'm using non-free variants in Debian, such as i965-va-driver-shaders
and intel-media-va-driver-non-free
).
I'll try to fiddle with caps and will try to render pipeline visualization to see what it's doing though, thanks for the hints.
EDIT: I take my words about J4125
working on Sid back. Just upgraded form 12 to Sid again and I don't see performance fixed. Not sure why I was sure about it working OK. Sorry, gotta do more research.
Now that's discovery for me:
Even thought vah264dec
and glupload
both support DMABuf
, it is not used by default.. video/x-raw
is used. So I guess if system is fast enough, I did not noticed copying penalty, so I guess you're right. I just need to specify caps correctly because so far I failed to make it work...
If I explicitly use "slow" version like this: ... vah264dec ! video/x-raw ! glimagesink
it works as it was before, but if I specify video/x-raw(memory:DMABuf)
instead it fails with kinda irrelevant error message failed delayed linking some pad of GstQTDemux named qtdemux0 to some pad of GstH264Parse named h264parse0
using this testing pipeline:
$ ./gst-launch-1.0 curlhttpsrc location="https://ia800201.us.archive.org/12/items/BigBuckBunny_328/BigBuckBunny_512kb.mp4" ! qtdemux! h264parse ! queue ! vah264dec ! "video/x-raw(memory:DMABuf)" ! glimagesink
Setting pipeline to PAUSED ...
Pipeline is PREROLLING ...
Got context from element 'sink': gst.gl.GLDisplay=context, gst.gl.GLDisplay=(GstGLDisplay)"\(GstGLDisplayX11\)\ gldisplayx11-0";
Got context from element 'vah264dec0': gst.va.display.handle=context, gst-display=(GstObject)"\(GstVaDisplayDrm\)\ vadisplaydrm1", description=(string)"Intel\ iHD\ driver\ for\ Intel\(R\)\ Gen\ Graphics\ -\ 22.2.1\ \(\)", path=(string)/dev/dri/renderD128;
ERROR: from element /GstPipeline:pipeline0/GstCurlHttpSrc:curlhttpsrc0: Internal data stream error.
Additional debug info:
../src/libs/gst/base/gstbasesrc.c(3132): gst_base_src_loop (): /GstPipeline:pipeline0/GstCurlHttpSrc:curlhttpsrc0:
streaming stopped, reason not-linked (-1)
ERROR: pipeline doesn't want to preroll.
WARNING: from element /GstPipeline:pipeline0/GstQTDemux:qtdemux0: Delayed linking failed.
Additional debug info:
gst/parse/grammar.y(853): gst_parse_no_more_pads (): /GstPipeline:pipeline0/GstQTDemux:qtdemux0:
failed delayed linking some pad of GstQTDemux named qtdemux0 to some pad of GstH264Parse named h264parse0
Setting pipeline to NULL ...
Freeing pipeline ...
Maybe I need to enable DMABuf suport some how in my systems..?
So the issue was that GStreamer does not support DMABuf with GLX. It works with EGL if I specify GST_GL_PLATFORM=egl
env. variable!
For example, using:
env GST_GL_PLATFORM=egl ./gst-launch-1.0 filesrc location=bbb_sunflower_1080p_30fps_normal.mp4 ! qtdemux! h264parse ! queue ! vah264dec ! glimagesink
I get flawless playback on N4000 with just around ~6% CPU, meanwhile if I go back to GLX I get ~100% CPU with lots of frame dropped. video/x-raw(memory:DMABuf)
is ignored with GLX.
Closing as invalid.
cool, how about intel_gpu_top result? Suppose you could playback several bitstream simultaneously
@XinfengZhang yeah it could play 4 streams. Well not flawlessly (with some rare stuttering), maybe 3 x 1080p would be more realistic/practical. Pretty good result for such a tiny "hdmi stick" pc like this:
intel_gpu_top
with four players:
yes, video engine still have rooms to decode more streams, but render engine utilization is full
@XinfengZhang Might work better without whole Gnome desktop, etc. In Wayland, etc. But still, great stuff. Thanks!