Binary search performance regressed in 1.25
Emerentius opened this issue · 36 comments
The speedup obtained by the optimizations from #45333 which hit stable in 1.23 were mostly lost again from 1.25 onwards, because the compiler started emitting a branch instead of a cmov.
1.24
14 cmp qword ptr [rsi + 8*rax], rcx
15 cmova rax, r81.25
18 cmp qword ptr [rsi + 8*r9], rcx
19 ja .LBB0_7
20 mov r8, r9
21 .LBB0_7:From godbolt: https://godbolt.org/z/DUDl-T
Bench comparison with 1.23 nightly (2017-11-20, 9 days after PR merge)
name old_1.23 ns/iter new_1.23 ns/iter diff ns/iter diff % speedup
binary_search_l1 70 28 -42 -60.00% x 2.50
binary_search_l1_with_dups 46 28 -18 -39.13% x 1.64
binary_search_l2 94 42 -52 -55.32% x 2.24
binary_search_l2_with_dups 68 42 -26 -38.24% x 1.62
binary_search_l3 267 234 -33 -12.36% x 1.14
binary_search_l3_with_dups 198 236 38 19.19% x 0.84
and status quo with 1.30 nightly (2018-08-15)
name old_1.30 ns/iter new_1.30 ns/iter diff ns/iter diff % speedup
binary_search_l1 75 68 -7 -9.33% x 1.10
binary_search_l1_with_dups 46 61 15 32.61% x 0.75
binary_search_l2 99 89 -10 -10.10% x 1.11
binary_search_l2_with_dups 71 89 18 25.35% x 0.80
binary_search_l3 272 245 -27 -9.93% x 1.11
binary_search_l3_with_dups 212 256 44 20.75% x 0.83
The likely culprit is the upgrade from LLVM 4 to 6. The upgrade was merged on the 10th of Feb. The binary search was fast on the 9th and slow on the 12th.
@estebank The optimization did hit stable (if only for 2 releases), so does it not need the regression-from-stable-to-stable label?
@Emerentius what do you mean for only 2 releases? Isn't the slowdown present in stable through nightly? Reading the assembly from godbolt it seems to me like there have been some changes but the code causing the slowdown presented in this issue is still there. If that is the case, then it would need that label, but if it was fixed in current stable, then we can close the ticket, as I do not believe we cut point releases for previous stable releases.
@estebank I mean it was fast in 1.23 stable and slow from 1.25 stable on, so fast for 2 stable releases, then regressed.
triage: P-medium, since I don't think investigating this takes priority over other things currently being juggled. but also self-assigning since I'm curious and I think I can do some bisection as a background task, assuming I can reproduce the perf regression locally.
discussed in T-compiler meeting. Re-prioritizing as P-high just for evaluating the scope of the regression (though some good work on that is already evident in this comment thread).
Not sure why I previously removed I-slow; I'm putting it back now.
Hey @Emerentius, what is the distinction you are making in the columns with the prefix "old_" and "new_" in the tables you present?
That is, I would have thought that "old_" somehow corresponds to the state prior to the changes from PR #45333 and "new_" corresponds to after PR #45333 landed; but you already established that PR #45333 landed in between the two versions that each have "old_" and "new_" labels applied to them.
So I don't understand what difference has been applied to either the code or the benchmark in between the two columns there. (Hypothesis: Is "old_" testing the binary search implementation prior to PR #45333, and "new_" testing the binary search implementation that was added by PR #45333 ?)
(In any case I definitely am able to reproduce a regression on my Linux desktop on benches/slice between nightly-2018-02-09 and 2018-02-12, as anticipated/noted by #53823 (comment) ; there's a fair amount of noise between runs when looking at binary_search_l3 in my own machine, but binary_search_l1 is much more consistent in illustrating a 2x regression for me.)
Okay so I've done some preliminary analysis of the scope of the problem.
On my desktop linux machine, I built the benchmark suite in libcore/benches atop the nightly compilers from 2018-02-09 and 2018-02-12. Then I did three runs of the suite on each compiler (for a total of six runs).
Update: On the LLVM ticket associated with this issue that was subsequently filed, someone noted that I didn't indicate the processor model where I gathered this data. Here that is, in a details block.
% cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 60
model name : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
stepping : 3
microcode : 0x25
cpu MHz : 1728.267
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_goo\
d nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdra\
nd lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts flush_l\
1d
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips : 7183.27
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 60
model name : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
stepping : 3
microcode : 0x25
cpu MHz : 1545.735
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 1
cpu cores : 4
apicid : 2
initial apicid : 2
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_goo\
d nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdra\
nd lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts flush_l\
1d
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips : 7183.27
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
processor : 2
vendor_id : GenuineIntel
cpu family : 6
model : 60
model name : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
stepping : 3
microcode : 0x25
cpu MHz : 1588.301
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 2
cpu cores : 4
apicid : 4
initial apicid : 4
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_goo\
d nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdra\
nd lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts flush_l\
1d
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips : 7183.27
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 60
model name : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
stepping : 3
microcode : 0x25
cpu MHz : 2202.815
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 3
cpu cores : 4
apicid : 6
initial apicid : 6
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_goo\
d nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdra\
nd lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts flush_l\
1d
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips : 7183.27
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
processor : 4
vendor_id : GenuineIntel
cpu family : 6
model : 60
model name : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
stepping : 3
microcode : 0x25
cpu MHz : 1990.709
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_goo\
d nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdra\
nd lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts flush_l\
1d
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips : 7183.27
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
processor : 5
vendor_id : GenuineIntel
cpu family : 6
model : 60
model name : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
stepping : 3
microcode : 0x25
cpu MHz : 2112.084
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 1
cpu cores : 4
apicid : 3
initial apicid : 3
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_goo\
d nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdra\
nd lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts flush_l\
1d
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips : 7183.27
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
processor : 6
vendor_id : GenuineIntel
cpu family : 6
model : 60
model name : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
stepping : 3
microcode : 0x25
cpu MHz : 1789.977
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 2
cpu cores : 4
apicid : 5
initial apicid : 5
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_goo\
d nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdra\
nd lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts flush_l\
1d
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips : 7183.27
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 60
model name : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
stepping : 3
microcode : 0x25
cpu MHz : 2148.265
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 3
cpu cores : 4
apicid : 7
initial apicid : 7
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_goo\
d nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdra\
nd lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts flush_l\
1d
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips : 7183.27
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
Here is the raw data from that exercise: https://gist.github.com/8f511af7012c750928a6bdbf8efa3e6f
Then I put it into a spreadsheet, took the min (of the three runs) of the reported ns/iter for each benchmark, and asked to see the ratio (i.e. slowdown, where 2.0 is a 2x slowdown, 1.0 is no change, and 0.5 is a 2x speedup), comparing the min for 2018-02-09 to 2018-02-12.
- (One might argue I should use average instead of minimum... but for understanding compiler output, I go along with the argument presented here that usually the outliers are much slower than normal, rather than faster, and so taking the minimum is more likely to filter out transient noise rather than artificial improvement)
Finally, when looking at the slowdown ratio, I asked for the values > 1.2 to be highlighted in red (so that we'd pay special attention to a 1.2x slowdown) and the values < 0.9 to be highlighted in green (so that we'd also notice reasonable speedups).
Here is that spreadsheet: https://www.dropbox.com/s/qznzchqhibi5o67/three-runs.xlsx?dl=0
A number of benchmarks improved significantly (i.e. the slowdown ratio is <0.9) over the time period specified. I'm not going to list them here, but there are ~44 of them.
But here are the 9 cases that regressed:
| test | 2018-02-09 ns/iter (min of 3) | 2018-02-12 ns/iter (min of 3) | slowdown |
|---|---|---|---|
| char::methods::bench_to_digit_radix_10 | 16,208 | 20,078 | 1.24x |
| iter::bench_zip_then_skip | 2,580 | 3,899 | 1.51x |
| num::bench_u32_from_str_radix_2 | 17,896 | 23,223 | 1.30x |
| slice::binary_search_l1 | 21 | 51 | 2.43x |
| slice::binary_search_l1_with_dups | 21 | 46 | 2.19x |
| slice::binary_search_l2 | 31 | 67 | 2.16x |
| slice::binary_search_l2_with_dups | 31 | 67 | 2.16x |
| slice::binary_search_l3 | 108 | 141 | 1.31x |
| slice::binary_search_l3_with_dups | 109 | 142 | 1.30x |
Obviously the latter six cases are the ones that were already being pointed out by this bug.
(The other three entries in the table may or may not have the same core cause.)
Obviously the libcore/benches benchmark should not be the sole decider about whether a regression is significant, but based on these results, the impact does seem to be somewhat restricted in scope...
Also, there are some pretty interesting discussions among LLVM developers about when they turn cmov into branches. It sounds there are a number of scenarios where this exhibits better performance due to (I think) either 1. the processor's branch prediction or 2. an expensive test condition that you'd rather evaluate just once
In particular, in the LLVM developer discussions linked above, I saw mention of the following flag: -x86-cmov-converter=false
I gave that a whirl, and the results seem promising:
% rustc +nightly-2018-02-12 -O -C llvm-args=-x86-cmov-converter=false --test lib.rs -o ../benches-2018-02-12.nocmovconv
% rustc +nightly-2018-02-12 -O --test lib.rs -o ../benches-2018-02-12
% rustc +nightly-2018-02-09 -O --test lib.rs -o ../benches-2018-02-09
% ../benches-2018-02-09 --bench slice
running 6 tests
test slice::binary_search_l1 ... bench: 21 ns/iter (+/- 0)
test slice::binary_search_l1_with_dups ... bench: 21 ns/iter (+/- 0)
test slice::binary_search_l2 ... bench: 31 ns/iter (+/- 0)
test slice::binary_search_l2_with_dups ... bench: 31 ns/iter (+/- 0)
test slice::binary_search_l3 ... bench: 110 ns/iter (+/- 1)
test slice::binary_search_l3_with_dups ... bench: 112 ns/iter (+/- 7)
test result: ok. 0 passed; 0 failed; 0 ignored; 6 measured; 142 filtered out
% ../benches-2018-02-12 --bench slice
running 6 tests
test slice::binary_search_l1 ... bench: 51 ns/iter (+/- 0)
test slice::binary_search_l1_with_dups ... bench: 46 ns/iter (+/- 0)
test slice::binary_search_l2 ... bench: 67 ns/iter (+/- 1)
test slice::binary_search_l2_with_dups ... bench: 67 ns/iter (+/- 0)
test slice::binary_search_l3 ... bench: 143 ns/iter (+/- 1)
test slice::binary_search_l3_with_dups ... bench: 142 ns/iter (+/- 0)
test result: ok. 0 passed; 0 failed; 0 ignored; 6 measured; 142 filtered out
% ../benches-2018-02-12.nocmovconv --bench slice
running 6 tests
test slice::binary_search_l1 ... bench: 21 ns/iter (+/- 0)
test slice::binary_search_l1_with_dups ... bench: 21 ns/iter (+/- 0)
test slice::binary_search_l2 ... bench: 31 ns/iter (+/- 1)
test slice::binary_search_l2_with_dups ... bench: 31 ns/iter (+/- 0)
test slice::binary_search_l3 ... bench: 114 ns/iter (+/- 2)
test slice::binary_search_l3_with_dups ... bench: 115 ns/iter (+/- 1)
test result: ok. 0 passed; 0 failed; 0 ignored; 6 measured; 142 filtered out
To be clear: I am not suggesting that we should turn the x86-cmov-converter pass off by default in rustc.
But:
- we should check whether the improved version from https://reviews.llvm.org/D36081 is in our LLVM (and if not, consider cherry-picking it over), and
- if we can reproduce this issue on the current LLVM version, then we should file an issue with them upstream.
Just checked, that patch is in our LLVM, and this still reproduces with current LLVM (running generated IR through llc).
I think the issue here might be that cmov conversion does not realize that it will not be possible to sink the cmovd chain into the created branches, because the condition also depends on them. I misunderstood what this pass is intended to do. It's not interested in hoisting anything into the branches, the only point is to remove the dependency on the branch condition through speculation.
I've filed an LLVM issue: https://bugs.llvm.org/show_bug.cgi?id=40027
Okay at this point I think I've convinced myself that the scope of this problem is somewhat limited,
So I'm going to downgrade this to P-medium.
It also isn't really going to be addressed on our end (except perhaps via cherry-picking a patch for the LLVM issue).
So I'm tempted to close it, but I'll maybe wait to let other members of @rust-lang/compiler weigh in before I do that.
I'll leave myself assigned for now, but mostly as a reminder to close it in the future.
If someone else wants to take this issue and run with it, feel free to assign yourself and unassign me.
I think the main actionable point here is to try and change the binary search implementation in a way that does not run into this LLVM issue. Not sure how that would look like, but something that interchanges the order of the compare and the load (i.e. already load for the next iteration) might work, because this is a pattern that LLVM specifically checks for.
Putting this on T-libs if they can find a libs-level solution.
If somebody knows a workaround, I'm interested.
Intersection queries in tantivy are roughly 10% slower because of this.
In nightly, one can force cmov with inline asm!.
For instance, in the case of binary search.
(I personally cannot use nightly.)
fn binary_search(arr: &[u32], mut base: usize, mut len: usize, target: u32) -> usize {
while len > 1 {
let half = len / 2;
let mid = base + half;
unsafe {
let pivot: u32 = *arr.get_unchecked(mid);
asm!("cmpl $2, $1\ncmovge $3, $0"
: "+r"(base)
: "r"(target), "r"(pivot), "r"(mid))
;
}
len -= half;
}
base + ((*unsafe { arr.get_unchecked(base) } < target) as usize)
}@fulmicoton have you tried passing -C llvm-args=-x86-cmov-converter=false to rustc? I noted here above that this seems to recover most of the lost performance on the relevant slice benchmarks.
Unassigning self: I don't think there's much for me to do here from the viewpoint of rustc compiler, apart from hypothetically turning off x86-cmov-converter by default, which seems like a bad idea.
I'm not fluent in asm but it looks as though the problem still exists.
1.24:
.LBB0_4:
mov r9, rdx
shr r9
lea rax, [r9 + r8]
cmp qword ptr [rsi + 8*rax], rcx
cmova rax, r8
sub rdx, r9
cmp rdx, 1
mov r8, rax
ja .LBB0_41.55-nightly:
.LBB0_3:
shr rdx
add rdx, rsi
cmp qword ptr [rdi + 8*rdx], r8
jb .LBB0_6
je .LBB0_9
mov rcx, rdx
sub rdx, rsi
ja .LBB0_3Note that #74024 significantly changed the binary_search implementation in the meantime so they are no longer comparable. Instead you'll have to copy the old implementation into godbolt to retain comparability. That said, the issue still exists there too.
https://godbolt.org/z/8cqKdxG4K
pub fn binary_search(slice: &[u64], key: u64) -> Result<usize, usize> {
binary_search_by(slice, |val| val.cmp(&key) )
}
use std::cmp::Ordering::{self, Greater, Less, Equal};
fn binary_search_by<'a, T, F>(slice: &'a [T], mut f: F) -> Result<usize, usize>
where F: FnMut(&'a T) -> Ordering
{
let s = slice;
let mut size = s.len();
if size == 0 {
return Err(0);
}
let mut base = 0usize;
while size > 1 {
let half = size / 2;
let mid = base + half;
// mid is always in [0, size).
// mid >= 0: by definition
// mid < size: mid = size / 2 + size / 4 + size / 8 ...
let cmp = f(unsafe { s.get_unchecked(mid) });
base = if cmp == Greater { base } else { mid };
size -= half;
}
// base is always in [0, size) because base <= mid.
let cmp = f(unsafe { s.get_unchecked(base) });
if cmp == Equal { Ok(base) } else { Err(base + (cmp == Less) as usize) }
}Putting this on T-libs if they can find a libs-level solution.
I have tested several older revisions of the binary_search and partition_point implementations. All of them had their conditional moves converted to cmp + branch due to this llvm change.
This LLVM issue comment suggests that the primary heuristic which turns off this optimization only looks for binary tree searches (unpredictable pointer chasing) but not binary searches (unpredictable pointer arithmetic).
Another heuristic related to cost modelling has been added. Modifying its threshold via -C llvm-args=-x86-cmov-converter-threshold=10 does revert the old binary search implementation (see previous comment) to conditional moves. This doesn't work for the current, short-circuiting implementation from #74024 (but running that one on rust 1.24 does produce cmovs).
I don't think there's anything we could do on the libs side to trigger that heuristic with its current default (4).
The options I currently see are:
- wait for an llvm fix (might require a new
unpredictableintrinsic) - special-case x86 with assembly
- tweak llvm-args
Only option 2 would be a pure libs thing. Some of those options would still require changes to the current implementation and benchmark runs. The old while size > 1 approach seems to be easier to turn into cmovs than current while left < right one. Or separate implementations partition_point and binary_search may be needed.
We discussed this issue in our last libs team meeting and concluded that there's nothing left for libs to do, and that we need to wait for the fix to happen upstream in https://bugs.llvm.org/show_bug.cgi?id=40027. Our plan is to hand this over to the compiler team for tracking as discussed in https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/tracking.20an.20old.20regression. For now we've tagged both teams and nominated it so that the compiler team can confirm the handover in their next meeting.
@yaahc @joshtriplett I'm happy to confirm handover to T-compiler, but the prioritization of this (currently P-low; P-medium as of 8 days ago) essentially means that we're never going to prioritize it for discussion in a meeting. It will just be task hanging out there for interested people to look at and discuss.
So, I just wanted to confirm: Is that an acceptable outcome from the point of view of you all? Or do you think this should be assigned higher priority?
@pnkfelix That's roughly the outcome we were expecting.
We do feel that this would be worthwhile to have on the radar of a team whose primary purview is LLVM-related fixes that would benefit Rust, but we also recognize that that's a "somebody should", and we don't have a somebody to offer.
What Josh said, I've gone ahead and removed the T-libs label, thanks for the confirmation @pnkfelix ^_^.
Could something like https://docs.rs/cmov/latest/cmov/ be used in the interim to force the usage of cmov to recover the performance?
Is profile guided optimization able to do anything in this case maybe?
The x86 backend should now respect __builtin_unpredicate for cmov conversion, so if we can expose that in rustc in some way, we can use that.
There's a discussion what to do about likely()/unlikely() on IRLO. Adding an unpredictable() was mentioned too.