memorysafety/rav1d

Tracking issue for performance

Opened this issue · 7 comments

This issue is intended to aggregate, track progress, and discuss performance optimization for rav1d.

Unless otherwise noted, the following conditions apply to these measurements:

  • all C compilation is done with Clang 18 to match the LLVM backend used by current rustc
  • Test input is 8-bit Chimera
  • --threads 8
CPU Test Time (s)
7700X dav1d 2355eeb 5.286
7700X rav1d 412cd4c 5.766 (9%)
7700X dav1d 2355eeb 10-bit 13.538
7700X rav1d 412cd4c 10-bit 14.32 (5.8%)
i7-1260p dav1d 2355eeb 16.147
i7-1260p rav1d 412cd4c 17.287 (7%)
i7-12700K dav1d 2355eeb 6.663
i7-12700K rav1d 412cd4c 7.075 (6%)
M2 MacBook (AAarch64) dav1d 2355eeb 8.958
M2 MacBook (AAarch64) dav1d w/out backports 412cd4c 9.106
M2 MacBook (AAarch64) rav1d 412cd4c 9.818 (9.6% upstream, 7.8% w/out some backports) **
Pixel 8 (Tensor G3) dav1d 2355eeb 34.529
Pixel 8 (Tensor G3) rav1d 412cd4c 38.504 (11.5%)***

** Some AArch64 relevant backports are not yet completed
*** Using NDK 27 RC 1; used hyperfine --warmup 3 ... to lower variance

Latest results (raw times are a bit faster across the board because I'm benchmarking with a quieter OS environment, I re-did baselines for consistency):

8-bit Chimera:

CPU Test Time (s)
7700X dav1d 2355eeb 5.148
7700X dav1d 2355eeb (full LTO w/ LLD) 5.172 **
7700X rav1d b80f922 5.572 (8.2%)
7700X rav1d #1320 5.492 (6.7%)

10-bit Chimera:

CPU Test Time (s)
7700X dav1d 2355eeb 10-bit 13.204
7700X rav1d b80f922 10-bit 14.035 (6.3%)
7700X rav1d #1320 13.995 (6.0%)

** Full LTO for the C code seems to make performance slightly worse, if anything. I'm surprised by this but the measurements are consistent on my machine.

Latest results:

8-bit Chimera:

CPU Test Time (s)
7700X dav1d 2355eeb 5.148
7700X rav1d main 74f485b 5.500 (6.8%)
7700X rav1d #1325 5.436 (5.6%)

10-bit Chimera:

CPU Test Time (s)
7700X dav1d 2355eeb 13.204
7700X rav1d main 74f485b 13.894 (5.2%)
7700X rav1d #1325 13.895 (5.2%)

AArch64 results after backporting #1300:

8-bit Chimera:

CPU Test Time (s)
M2 dav1d 2355eeb 8.956
M2 rav1d main b26781a 9.625 (7.5%)

10-bit Chimera:

CPU Test Time (s)
M2 dav1d 2355eeb 28.23
M2 rav1d main b26781a 29.529 (4.6%)

I've been looking into what remaining sources of overhead might be and wanted to chime in with some of my findings.

One major discrepency when running benchmarks I noticed between dav1d and rav1d was an order of magnitude difference in the number of madvise calls and and page-faults. I also saw a larger number of context-switches in rav1d than dav1d (likely related to page-faults?). This may explain at least some of the performance difference seen.

Digging into this, I found the majority (~82%) of the page-faults in rav1d are coming from rav1d::src::decode::rav1d_submit_frame. Specifically, the stack trace points to:

<alloc::boxed::Box<[rav1d::src::refmvs::RefMvsTemporalBlock]> as core::iter::traits::collect::FromIterator<rav1d::src::refmvs::RefMvsTemporalBlock>>::from_iter::<core::iter::adapters::map::Map<core::ops::range::Range<usize>, rav1d::src::decode::rav1d_submit_frame::{closure#1}>>

I believe that corresponds to this closure:

rav1d/src/decode.rs

Lines 5223 to 5227 in 7d72409

f.mvs = Some(
(0..f.sb128h as usize * 16 * (f.b4_stride >> 1) as usize)
.map(|_| Default::default())
.collect(),
);

This is the equivalent operation in dav1d:

rav1d/src/decode.c

Lines 3623 to 3624 in 7d72409

f->mvs_ref = dav1d_ref_create_using_pool(c->refmvs_pool,
sizeof(*f->mvs) * f->sb128h * 16 * (f->b4_stride >> 1));

Here dav1d_submit_frame allocates using dav1d_ref_create_using_pool. This then calls into dav1d_mem_pool_pop, which allocates from pooled memory (initialized in dav1d_mem_pool_init). This likely reduces the amount of allocator calls.

The switch from using pooled memory in rav1d looks to have been introduced as part of 6420e5a, PR #984.

@ivanloz, thanks for finding this! That's definitely something we changed, and I had thought we hadn't seen a performance impact outside of the picture pooled allocator, which we kept pooled, but maybe we missed it or it has different behavior on different systems.

Could you put your current above in its own issue? We'll work on fixing it. It is tricky due to the lifetimes involved (the picture pool got around this because it already has to go through an unsafe C API for {D,R}av1dPicAllocator), but if it's affecting performance, we'll figure out how to support it.

Done -- see #1358, thanks!