Tracking issue for performance

This issue is intended to aggregate, track progress, and discuss performance optimization for rav1d.

Unless otherwise noted, the following conditions apply to these measurements:

all C compilation is done with Clang 18 to match the LLVM backend used by current rustc
Test input is 8-bit Chimera
--threads 8

CPU	Test	Time (s)
7700X	dav1d 2355eeb	5.286
7700X	rav1d `412cd4c`	5.766 (9%)
7700X	dav1d 2355eeb 10-bit	13.538
7700X	rav1d `412cd4c` 10-bit	14.32 (5.8%)
i7-1260p	dav1d 2355eeb	16.147
i7-1260p	rav1d `412cd4c`	17.287 (7%)
i7-12700K	dav1d 2355eeb	6.663
i7-12700K	rav1d `412cd4c`	7.075 (6%)
M2 MacBook (AAarch64)	dav1d 2355eeb	8.958
M2 MacBook (AAarch64)	dav1d w/out backports `412cd4c`	9.106
M2 MacBook (AAarch64)	rav1d `412cd4c`	9.818 (9.6% upstream, 7.8% w/out some backports) **
Pixel 8 (Tensor G3)	dav1d 2355eeb	34.529
Pixel 8 (Tensor G3)	rav1d `412cd4c`	38.504 (11.5%)***

** Some AArch64 relevant backports are not yet completed
*** Using NDK 27 RC 1; used hyperfine --warmup 3 ... to lower variance

Latest results (raw times are a bit faster across the board because I'm benchmarking with a quieter OS environment, I re-did baselines for consistency):

8-bit Chimera:

CPU	Test	Time (s)
7700X	dav1d 2355eeb	5.148
7700X	dav1d 2355eeb (full LTO w/ LLD)	5.172 **
7700X	rav1d `b80f922`	5.572 (8.2%)
7700X	rav1d #1320	5.492 (6.7%)

10-bit Chimera:

CPU	Test	Time (s)
7700X	dav1d 2355eeb 10-bit	13.204
7700X	rav1d `b80f922` 10-bit	14.035 (6.3%)
7700X	rav1d #1320	13.995 (6.0%)

** Full LTO for the C code seems to make performance slightly worse, if anything. I'm surprised by this but the measurements are consistent on my machine.

Latest results:

8-bit Chimera:

CPU	Test	Time (s)
7700X	dav1d 2355eeb	5.148
7700X	rav1d main `74f485b`	5.500 (6.8%)
7700X	rav1d #1325	5.436 (5.6%)

10-bit Chimera:

CPU	Test	Time (s)
7700X	dav1d 2355eeb	13.204
7700X	rav1d main `74f485b`	13.894 (5.2%)
7700X	rav1d #1325	13.895 (5.2%)

AArch64 results after backporting #1300:

8-bit Chimera:

CPU	Test	Time (s)
M2	dav1d 2355eeb	8.956
M2	rav1d main `b26781a`	9.625 (7.5%)

10-bit Chimera:

CPU	Test	Time (s)
M2	dav1d 2355eeb	28.23
M2	rav1d main `b26781a`	29.529 (4.6%)

I've been looking into what remaining sources of overhead might be and wanted to chime in with some of my findings.

One major discrepency when running benchmarks I noticed between dav1d and rav1d was an order of magnitude difference in the number of madvise calls and and page-faults. I also saw a larger number of context-switches in rav1d than dav1d (likely related to page-faults?). This may explain at least some of the performance difference seen.

Digging into this, I found the majority (~82%) of the page-faults in rav1d are coming from rav1d::src::decode::rav1d_submit_frame. Specifically, the stack trace points to:

<alloc::boxed::Box<[rav1d::src::refmvs::RefMvsTemporalBlock]> as core::iter::traits::collect::FromIterator<rav1d::src::refmvs::RefMvsTemporalBlock>>::from_iter::<core::iter::adapters::map::Map<core::ops::range::Range<usize>, rav1d::src::decode::rav1d_submit_frame::{closure#1}>>

I believe that corresponds to this closure:

rav1d/src/decode.rs

Lines 5223 to 5227 in 7d72409

    
           f.mvs = Some( 
        
               (0..f.sb128h as usize * 16 * (f.b4_stride >> 1) as usize) 
        
                   .map(|_| Default::default()) 
        
                   .collect(), 
        
           );

This is the equivalent operation in dav1d:

rav1d/src/decode.c

Lines 3623 to 3624 in 7d72409

    
           f->mvs_ref = dav1d_ref_create_using_pool(c->refmvs_pool, 
        
               sizeof(*f->mvs) * f->sb128h * 16 * (f->b4_stride >> 1));

Here dav1d_submit_frame allocates using dav1d_ref_create_using_pool. This then calls into dav1d_mem_pool_pop, which allocates from pooled memory (initialized in dav1d_mem_pool_init). This likely reduces the amount of allocator calls.

The switch from using pooled memory in rav1d looks to have been introduced as part of 6420e5a, PR #984.

@ivanloz, thanks for finding this! That's definitely something we changed, and I had thought we hadn't seen a performance impact outside of the picture pooled allocator, which we kept pooled, but maybe we missed it or it has different behavior on different systems.

Could you put your current above in its own issue? We'll work on fixing it. It is tricky due to the lifetimes involved (the picture pool got around this because it already has to go through an unsafe C API for {D,R}av1dPicAllocator), but if it's affecting performance, we'll figure out how to support it.

Done -- see #1358, thanks!

	f.mvs = Some(
	(0..f.sb128h as usize * 16 * (f.b4_stride >> 1) as usize)
	.map(\|_\| Default::default())
	.collect(),
	);

	f->mvs_ref = dav1d_ref_create_using_pool(c->refmvs_pool,
	sizeof(f->mvs) f->sb128h * 16 * (f->b4_stride >> 1));