`_mm256_loadu_si256` is failed to be inlined for ABI issues
usamoi opened this issue · 4 comments
Inspired by rust-lang/rust#121960, I'm looking for SIMD intrinsics that are not inlined in generated code.
https://github.com/BurntSushi/aho-corasick/blob/master/src/packed/vector.rs#L19C1-L27C59:
/// # Safety
///
/// All methods are not safe since they are intended to be implemented using
/// vendor intrinsics, which are also not safe. Callers must ensure that
/// the appropriate target features are enabled in the calling function,
/// and that the current CPU supports them. All implementations should
/// avoid marking the routines with `#[target_feature]` and instead mark
/// them as `#[inline(always)]` to ensure they get appropriately inlined.
/// (`inline(always)` cannot be used with target_feature.)
It's not fully true: if you do not mark the routines with #[target_feature]
, LLVM will reject to inline them since it does not know if inlining causes ABI issues. So we need to use both #[target_feature]
and #[inline(always)]
.
I find _mm256_loadu_si256
is failed to be inlined in my project and it also applies to the released cargo
binary. I think it's another rustc bug at first but finally objdump
leads me here.
Step to reproduce it:
Copy & paste the example in readme.
objdump ./target/release/play_rust -D --demangle | grep "core_arch"
e0f1: e8 7a fd 06 00 call 7de70 <core::core_arch::x86::xsave::_xgetbv>
0000000000029f70 <core::ptr::drop_in_place<&aho_corasick::packed::teddy::generic::Mask<core::core_arch::x86::__m128i>>>:
000000000002eba0 <core::ptr::drop_in_place<core::core_arch::x86::__m128i>>:
000000000002ebb0 <core::ptr::drop_in_place<core::core_arch::x86::__m256i>>:
000000000002f0d0 <<core::core_arch::x86::__m128i as core::fmt::Debug>::fmt>:
000000000002f120 <<core::core_arch::x86::__m256i as core::fmt::Debug>::fmt>:
0000000000031810 <core::ptr::drop_in_place<aho_corasick::packed::teddy::generic::Slim<core::core_arch::x86::__m128i,1_usize>>>:
0000000000031820 <core::ptr::drop_in_place<aho_corasick::packed::teddy::generic::Slim<core::core_arch::x86::__m128i,2_usize>>>:
0000000000031830 <core::ptr::drop_in_place<aho_corasick::packed::teddy::generic::Slim<core::core_arch::x86::__m128i,3_usize>>>:
0000000000031840 <core::ptr::drop_in_place<aho_corasick::packed::teddy::generic::Slim<core::core_arch::x86::__m128i,4_usize>>>:
48651: e8 7a 06 00 00 call 48cd0 <core::core_arch::x86::avx::_mm256_loadu_si256>
48678: e8 53 06 00 00 call 48cd0 <core::core_arch::x86::avx::_mm256_loadu_si256>
486c6: e8 05 06 00 00 call 48cd0 <core::core_arch::x86::avx::_mm256_loadu_si256>
486ea: e8 e1 05 00 00 call 48cd0 <core::core_arch::x86::avx::_mm256_loadu_si256>
48744: e8 87 05 00 00 call 48cd0 <core::core_arch::x86::avx::_mm256_loadu_si256>
48762: e8 69 05 00 00 call 48cd0 <core::core_arch::x86::avx::_mm256_loadu_si256>
48861: e8 6a 04 00 00 call 48cd0 <core::core_arch::x86::avx::_mm256_loadu_si256>
48888: e8 43 04 00 00 call 48cd0 <core::core_arch::x86::avx::_mm256_loadu_si256>
488d4: e8 f7 03 00 00 call 48cd0 <core::core_arch::x86::avx::_mm256_loadu_si256>
488f2: e8 d9 03 00 00 call 48cd0 <core::core_arch::x86::avx::_mm256_loadu_si256>
489a5: e8 26 03 00 00 call 48cd0 <core::core_arch::x86::avx::_mm256_loadu_si256>
489c3: e8 08 03 00 00 call 48cd0 <core::core_arch::x86::avx::_mm256_loadu_si256>
48a51: e8 7a 02 00 00 call 48cd0 <core::core_arch::x86::avx::_mm256_loadu_si256>
48a78: e8 53 02 00 00 call 48cd0 <core::core_arch::x86::avx::_mm256_loadu_si256>
48ac6: e8 05 02 00 00 call 48cd0 <core::core_arch::x86::avx::_mm256_loadu_si256>
48aea: e8 e1 01 00 00 call 48cd0 <core::core_arch::x86::avx::_mm256_loadu_si256>
48b44: e8 87 01 00 00 call 48cd0 <core::core_arch::x86::avx::_mm256_loadu_si256>
48b68: e8 63 01 00 00 call 48cd0 <core::core_arch::x86::avx::_mm256_loadu_si256>
48bc2: e8 09 01 00 00 call 48cd0 <core::core_arch::x86::avx::_mm256_loadu_si256>
48be0: e8 eb 00 00 00 call 48cd0 <core::core_arch::x86::avx::_mm256_loadu_si256>
0000000000048cd0 <core::core_arch::x86::avx::_mm256_loadu_si256>:
000000000007de70 <core::core_arch::x86::xsave::_xgetbv>:
It's not fully true: if you do not mark the routines with
#[target_feature]
, LLVM will reject to inline them since it does not know if inlining causes ABI issues. So we need to use both#[target_feature]
and#[inline(always)]
.
What part isn't "fully" true? You conclude by saying that both #[target_feature]
and #[inline(always)]
should be used, but that not only directly contradicts the comment you quoted, rustc
will reject it.
LLVM will reject to inline them since it does not know if inlining causes ABI issues
It does though. Because the caller of these functions has to uphold the safety contract that it's calling them from within a context where the appropriate target features are enabled.
Copy & paste the example in readme.
This is not a reproduction. You didn't include the specific steps you went through to produce the executable. When you give repro steps, you should provide every step with the commands you're running. Like this:
$ mkdir -p i140
$ cat Cargo.toml
[package]
publish = false
name = "i140"
version = "0.1.0"
edition = "2021"
[dependencies]
aho-corasick = "1.1.3"
anyhow = "1.0.81"
[[bin]]
name = "i140"
path = "main.rs"
$ cat main.rs
use aho_corasick::{AhoCorasick, PatternID};
fn main() {
let patterns = &["apple", "maple", "Snapple"];
let haystack = "Nobody likes maple in their apple flavored Snapple.";
let ac = AhoCorasick::new(patterns).unwrap();
let mut matches = vec![];
for mat in ac.find_iter(haystack) {
matches.push((mat.pattern(), mat.start(), mat.end()));
}
assert_eq!(
matches,
vec![
(PatternID::must(1), 13, 18),
(PatternID::must(0), 28, 33),
(PatternID::must(2), 43, 50),
]
);
}
$ cargo build --release
Compiling anyhow v1.0.81
Compiling memchr v2.7.1
Compiling aho-corasick v1.1.3
Compiling i140 v0.1.0 (/home/andrew/tmp/issues/aho-corasick/i140)
Finished `release` profile [optimized] target(s) in 1.86s
Okay, and now I can try running your command:
$ objdump ./target/release/i140 -D --demangle | grep "core_arch"
e0f1: e8 7a fd 06 00 call 7de70 <core::core_arch::x86::xsave::_xgetbv>
0000000000029f70 <core::ptr::drop_in_place<&aho_corasick::packed::teddy::generic::Mask<core::core_arch::x86::__m128i>>>:
000000000002eba0 <core::ptr::drop_in_place<core::core_arch::x86::__m128i>>:
000000000002ebb0 <core::ptr::drop_in_place<core::core_arch::x86::__m256i>>:
000000000002f0d0 <<core::core_arch::x86::__m128i as core::fmt::Debug>::fmt>:
000000000002f120 <<core::core_arch::x86::__m256i as core::fmt::Debug>::fmt>:
31541: e8 7a 06 00 00 call 31bc0 <core::core_arch::x86::avx::_mm256_loadu_si256>
31568: e8 53 06 00 00 call 31bc0 <core::core_arch::x86::avx::_mm256_loadu_si256>
315b4: e8 07 06 00 00 call 31bc0 <core::core_arch::x86::avx::_mm256_loadu_si256>
315d2: e8 e9 05 00 00 call 31bc0 <core::core_arch::x86::avx::_mm256_loadu_si256>
31685: e8 36 05 00 00 call 31bc0 <core::core_arch::x86::avx::_mm256_loadu_si256>
316a3: e8 18 05 00 00 call 31bc0 <core::core_arch::x86::avx::_mm256_loadu_si256>
31731: e8 8a 04 00 00 call 31bc0 <core::core_arch::x86::avx::_mm256_loadu_si256>
31758: e8 63 04 00 00 call 31bc0 <core::core_arch::x86::avx::_mm256_loadu_si256>
317a6: e8 15 04 00 00 call 31bc0 <core::core_arch::x86::avx::_mm256_loadu_si256>
317ca: e8 f1 03 00 00 call 31bc0 <core::core_arch::x86::avx::_mm256_loadu_si256>
31824: e8 97 03 00 00 call 31bc0 <core::core_arch::x86::avx::_mm256_loadu_si256>
31842: e8 79 03 00 00 call 31bc0 <core::core_arch::x86::avx::_mm256_loadu_si256>
31941: e8 7a 02 00 00 call 31bc0 <core::core_arch::x86::avx::_mm256_loadu_si256>
31968: e8 53 02 00 00 call 31bc0 <core::core_arch::x86::avx::_mm256_loadu_si256>
319b6: e8 05 02 00 00 call 31bc0 <core::core_arch::x86::avx::_mm256_loadu_si256>
319da: e8 e1 01 00 00 call 31bc0 <core::core_arch::x86::avx::_mm256_loadu_si256>
31a34: e8 87 01 00 00 call 31bc0 <core::core_arch::x86::avx::_mm256_loadu_si256>
31a58: e8 63 01 00 00 call 31bc0 <core::core_arch::x86::avx::_mm256_loadu_si256>
31ab2: e8 09 01 00 00 call 31bc0 <core::core_arch::x86::avx::_mm256_loadu_si256>
31ad0: e8 eb 00 00 00 call 31bc0 <core::core_arch::x86::avx::_mm256_loadu_si256>
0000000000031bc0 <core::core_arch::x86::avx::_mm256_loadu_si256>:
0000000000031ef0 <core::ptr::drop_in_place<aho_corasick::packed::teddy::generic::Slim<core::core_arch::x86::__m128i,1_usize>>>:
0000000000031f00 <core::ptr::drop_in_place<aho_corasick::packed::teddy::generic::Slim<core::core_arch::x86::__m128i,2_usize>>>:
0000000000031f10 <core::ptr::drop_in_place<aho_corasick::packed::teddy::generic::Slim<core::core_arch::x86::__m128i,3_usize>>>:
0000000000031f20 <core::ptr::drop_in_place<aho_corasick::packed::teddy::generic::Slim<core::core_arch::x86::__m128i,4_usize>>>:
000000000007de70 <core::core_arch::x86::xsave::_xgetbv>:
I don't really know what to make of this. It does seem to suggest that it isn't getting inline somewhere, but it doesn't say where. And in particular, if this were a systemic problem, something like "load a vector from memory" not being inlined would be very apparent on a profile. Indeed, in profiling, the routine is marked as inlined:
I got that profile by using perf
on Linux after building and running aho-corasick-debug
(found in the root of this project's repository):
$ cargo install --path aho-corasick-debug
$ aho-corasick-debug patterns quarter.txt --match-kind leftmost-first
You can use whatever kind of haystack you want for this. Just make it a big one. In my case, quarter.txt
is 3GB and is generated from the OpenSubtitles 2018 data set. You can get the full data file here: https://opus.nlpl.eu/download.php?f=OpenSubtitles/v2018/mono/OpenSubtitles.raw.en.gz
.
One possibility is that this function (via the generic Vector::load_unaligned
) is being used somewhere where the relevant target feature attributes haven't been properly applied. Perhaps when building the Teddy searcher. In that case, LLVM wouldn't inline it (because of the ABI problem). The function will still work correctly AIUI, but it won't be inlined. And since it would only happen once at construction time, its performance overhead would be non-existent and thus hard to see.
So... what made you report this issue? Are you hitting a real problem? Or are you fishing for one? If the former, please share the real problem you're hitting. If the latter, I'm happy to have this fixed, but it isn't something I'm keen on spending time on.
One possibility is that this function (via the generic Vector::load_unaligned) is being used somewhere where the relevant target feature attributes haven't been properly applied. Perhaps when building the Teddy searcher. In that case, LLVM wouldn't inline it (because of the ABI problem). The function will still work correctly AIUI, but it won't be inlined. And since it would only happen once at construction time, its performance overhead would be non-existent and thus hard to see.
Inlining fails on https://github.com/BurntSushi/aho-corasick/blob/master/src/packed/teddy/generic.rs#L1254. Since you think it's not a problem, I'm closing this issue.
So... what made you report this issue? Are you hitting a
real
problem? Or are you fishing for one?
I think I'm just reporting a potential problem since I'm not a user of this package. If you call it fishing, I'm fishing.
To be clear, fishing is fine. It's just important context to understand what is driving things here.
I'll re-open this. I didn't mean to say it wasn't a problem, just not one that I'll like prioritize myself to fix.
Inlining fails on https://github.com/BurntSushi/aho-corasick/blob/master/src/packed/teddy/generic.rs#L1254.
Yeah indeed. It looks like going through array.map(...)
fouls things up here. Sigh.