Performance regressions against dav1d master on aarch64
Opened this issue · 2 comments
The target/aarch64-unknown-linux-gnu/release/dav1d
binary takes 5.8% more time and 22% more memory to decode 8-bit video than dav1d-1.4.0-83-g872e470
and 5.3% more time and 6.7% more memory to decode 10-bit video.
dav1d 1.4.0-83-g872e470 | rav1d 966d63c | % delta | |
---|---|---|---|
8-bit User time (s) | 606.34 | 641.91 | 5.87% |
10-bit User time (s) | 1002.20 | 1055.09 | 5.28% |
8-bit RSS (kbytes) | 201076 | 246724 | 22.70% |
10-bit RSS (kbytes) | 306708 | 327140 | 6.66% |
Full command lines and output data below
negge@arm1:~/git/dav1d/build# /usr/bin/time -v tools/dav1d -i ~/Videos/Chimera/Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null
dav1d 1.4.0-83-g872e470 - by VideoLAN
Decoded 8929/8929 frames (100.0%) - 181.09/23.98 fps (7.55x)
Command being timed: "tools/dav1d -i /home/negge/Videos/Chimera/Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null"
User time (seconds): 606.34
System time (seconds): 43.91
Percent of CPU this job got: 1316%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:49.41
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 201076
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 162896
Voluntary context switches: 2333840
Involuntary context switches: 1822071
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
negge@arm1:~/git/rav1d# /usr/bin/time -v target/aarch64-unknown-linux-gnu/release/dav1d -i ~/Videos/Chimera/Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null
dav1d 966d63c1 - by VideoLAN
Decoded 8929/8929 frames (100.0%) - 170.58/23.98 fps (7.11x)
Command being timed: "target/aarch64-unknown-linux-gnu/release/dav1d -i /home/negge/Videos/Chimera/Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null"
User time (seconds): 641.91
System time (seconds): 51.00
Percent of CPU this job got: 1320%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:52.47
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 246724
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 4
Minor (reclaiming a frame) page faults: 232651
Voluntary context switches: 2243979
Involuntary context switches: 1968275
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
negge@arm1:~/git/dav1d.jeffv/build# /usr/bin/time -v tools/dav1d -i ~/Videos/Chimera/Chimera-AV1-10bit-1920x1080-6191kbps.ivf -o /dev/null
dav1d 1.4.0-83-g872e470 - by VideoLAN
Decoded 8929/8929 frames (100.0%) - 114.57/23.98 fps (4.78x)
Command being timed: "tools/dav1d -i /home/negge/Videos/Chimera/Chimera-AV1-10bit-1920x1080-6191kbps.ivf -o /dev/null"
User time (seconds): 1002.20
System time (seconds): 60.66
Percent of CPU this job got: 1355%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:18.39
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 306708
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 3
Minor (reclaiming a frame) page faults: 374204
Voluntary context switches: 2633828
Involuntary context switches: 2819241
Swaps: 0
File system inputs: 562920
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
negge@arm1:~/git/rav1d# /usr/bin/time -v target/aarch64-unknown-linux-gnu/release/dav1d -i ~/Videos/Chimera/Chimera-AV1-10bit-1920x1080-6191kbps.ivf -o /dev/null
dav1d 966d63c1 - by VideoLAN
Decoded 8929/8929 frames (100.0%) - 42.27/23.98 fps (1.76x)
Command being timed: "target/aarch64-unknown-linux-gnu/release/dav1d -i /home/negge/Videos/Chimera/Chimera-AV1-10bit-1920x1080-6191kbps.ivf -o /dev/null"
User time (seconds): 1055.09
System time (seconds): 63.88
Percent of CPU this job got: 529%
Elapsed (wall clock) time (h:mm:ss or m:ss): 3:31.40
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 327140
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 523375
Voluntary context switches: 3074044
Involuntary context switches: 2714008
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
@negge Thank you for sharing. Indeed, there is currently a performance gap between dav1d
and rav1d
. We haven't really spent much time to try to close it, but I've been tracking performance to make sure the gap doesn't grow larger.
I ran a quick profiling test with summer_nature_1080p
. There are clearly some Rust functions that are slower than their C equivalents.
dav1d 1.4.0
msac_decode_symbol_adapt4_neon (in libdav1d.7.dylib) 4176
decode_coefs (in libdav1d.7.dylib) 2995
prep_8tap_neon (in libdav1d.7.dylib) 2019
decode_b (in libdav1d.7.dylib) 1092
dav1d_refmvs_find (in libdav1d.7.dylib) 879
load_tmvs_c (in libdav1d.7.dylib) 862
add_temporal_candidate (in libdav1d.7.dylib) 667
mc (in libdav1d.7.dylib) 635
dav1d_recon_b_inter_8bpc (in libdav1d.7.dylib) 633
put_8tap_neon (in libdav1d.7.dylib) 610
add_spatial_candidate (in libdav1d.7.dylib) 561
prep_neon (in libdav1d.7.dylib) 471
dav1d_create_lf_mask_inter (in libdav1d.7.dylib) 394
wiener_filter7_hv_8bpc_neon (in libdav1d.7.dylib) 324
msac_decode_bool_adapt_neon (in libdav1d.7.dylib) 275
avg_8bpc_neon (in libdav1d.7.dylib) 229
decode_sb (in libdav1d.7.dylib) 224
wiener_filter5_hv_8bpc_neon (in libdav1d.7.dylib) 215
cdef_filter8_sec_edged_8bpc_neon (in libdav1d.7.dylib) 205
msac_decode_hi_tok_neon (in libdav1d.7.dylib) 193
rav1d a46bb72f
msac_decode_symbol_adapt4_neon (in dav1d) 4057
rav1d::src::recon::decode_coefs::h81e1bc840e33f180 (in dav1d) 2899
prep_8tap_neon (in dav1d) 2079
rav1d::src::decode::decode_b_inner::h9fd83b148c970197 (in dav1d) 1850
rav1d::src::refmvs::add_temporal_candidate::h01ce0e51e98e92ff (in dav1d) 935
rav1d::src::refmvs::load_tmvs_c::ha11d4ff5a82bb433 (in dav1d) 933
rav1d::src::refmvs::rav1d_refmvs_find::hc47eb9832700db67 (in dav1d) 761
rav1d::src::refmvs::add_spatial_candidate::h2dd8a7df8b9924f2 (in dav1d) 729
rav1d::src::recon::mc::h2ba26a7da1b07206 (in dav1d) 711
rav1d::src::recon::rav1d_recon_b_inter::h54eca88c13ef1aa0 (in dav1d) 629
_platform_memset (in libsystem_platform.dylib) 625
put_8tap_neon (in dav1d) 566
prep_neon (in dav1d) 433
wiener_filter7_hv_8bpc_neon (in dav1d) 358
rav1d::src::decode::decode_sb::h2372da60409d4a19 (in dav1d) 293
cdef_filter8_sec_edged_8bpc_neon (in dav1d) 268
msac_decode_bool_adapt_neon (in dav1d) 260
rav1d::src::refmvs::scan_row::h5d3369bc5f56f722 (in dav1d) 206
wiener_filter5_hv_8bpc_neon (in dav1d) 197
rav1d::src::recon::rav1d_recon_b_intra::h050feaff11ff5100 (in dav1d) 194
Hi @negge, thanks for benchmarking this! As @fbossen said, we haven't had much time to look at performance yet as we've been focused primarily on making everything safe first (while trying not to introduce and performance regressions). Once that's accomplished, we'll turn to closing the performance gap.