Tracking issue for performance
Opened this issue · 7 comments
This issue is intended to aggregate, track progress, and discuss performance optimization for rav1d.
Unless otherwise noted, the following conditions apply to these measurements:
- all C compilation is done with Clang 18 to match the LLVM backend used by current rustc
- Test input is 8-bit Chimera
--threads 8
CPU | Test | Time (s) |
---|---|---|
7700X | dav1d 2355eeb | 5.286 |
7700X | rav1d 412cd4c | 5.766 (9%) |
7700X | dav1d 2355eeb 10-bit | 13.538 |
7700X | rav1d 412cd4c 10-bit | 14.32 (5.8%) |
i7-1260p | dav1d 2355eeb | 16.147 |
i7-1260p | rav1d 412cd4c | 17.287 (7%) |
i7-12700K | dav1d 2355eeb | 6.663 |
i7-12700K | rav1d 412cd4c | 7.075 (6%) |
M2 MacBook (AAarch64) | dav1d 2355eeb | 8.958 |
M2 MacBook (AAarch64) | dav1d w/out backports 412cd4c | 9.106 |
M2 MacBook (AAarch64) | rav1d 412cd4c | 9.818 (9.6% upstream, 7.8% w/out some backports) ** |
Pixel 8 (Tensor G3) | dav1d 2355eeb | 34.529 |
Pixel 8 (Tensor G3) | rav1d 412cd4c | 38.504 (11.5%)*** |
** Some AArch64 relevant backports are not yet completed
*** Using NDK 27 RC 1; used hyperfine --warmup 3 ...
to lower variance
Latest results (raw times are a bit faster across the board because I'm benchmarking with a quieter OS environment, I re-did baselines for consistency):
8-bit Chimera:
CPU | Test | Time (s) |
---|---|---|
7700X | dav1d 2355eeb | 5.148 |
7700X | dav1d 2355eeb (full LTO w/ LLD) | 5.172 ** |
7700X | rav1d b80f922 | 5.572 (8.2%) |
7700X | rav1d #1320 | 5.492 (6.7%) |
10-bit Chimera:
CPU | Test | Time (s) |
---|---|---|
7700X | dav1d 2355eeb 10-bit | 13.204 |
7700X | rav1d b80f922 10-bit | 14.035 (6.3%) |
7700X | rav1d #1320 | 13.995 (6.0%) |
** Full LTO for the C code seems to make performance slightly worse, if anything. I'm surprised by this but the measurements are consistent on my machine.
I've been looking into what remaining sources of overhead might be and wanted to chime in with some of my findings.
One major discrepency when running benchmarks I noticed between dav1d and rav1d was an order of magnitude difference in the number of madvise
calls and and page-faults. I also saw a larger number of context-switches in rav1d than dav1d (likely related to page-faults?). This may explain at least some of the performance difference seen.
Digging into this, I found the majority (~82%) of the page-faults in rav1d are coming from rav1d::src::decode::rav1d_submit_frame
. Specifically, the stack trace points to:
<alloc::boxed::Box<[rav1d::src::refmvs::RefMvsTemporalBlock]> as core::iter::traits::collect::FromIterator<rav1d::src::refmvs::RefMvsTemporalBlock>>::from_iter::<core::iter::adapters::map::Map<core::ops::range::Range<usize>, rav1d::src::decode::rav1d_submit_frame::{closure#1}>>
I believe that corresponds to this closure:
Lines 5223 to 5227 in 7d72409
This is the equivalent operation in dav1d:
Lines 3623 to 3624 in 7d72409
Here dav1d_submit_frame
allocates using dav1d_ref_create_using_pool
. This then calls into dav1d_mem_pool_pop
, which allocates from pooled memory (initialized in dav1d_mem_pool_init
). This likely reduces the amount of allocator calls.
The switch from using pooled memory in rav1d looks to have been introduced as part of 6420e5a, PR #984.
@ivanloz, thanks for finding this! That's definitely something we changed, and I had thought we hadn't seen a performance impact outside of the picture pooled allocator, which we kept pooled, but maybe we missed it or it has different behavior on different systems.
Could you put your current above in its own issue? We'll work on fixing it. It is tricky due to the lifetimes involved (the picture pool got around this because it already has to go through an unsafe C API for {D,R}av1dPicAllocator
), but if it's affecting performance, we'll figure out how to support it.