Binary search performance regressed in 1.25

Question

Binary search performance regressed in 1.25

Emerentius opened this issue 7 years ago · 36 comments

The speedup obtained by the optimizations from #45333 which hit stable in 1.23 were mostly lost again from 1.25 onwards, because the compiler started emitting a branch instead of a cmov.

1.24

14  cmp     qword ptr [rsi + 8*rax], rcx
15  cmova   rax, r8

1.25

18  cmp     qword ptr [rsi + 8*r9], rcx
19  ja      .LBB0_7
20  mov     r8, r9
21  .LBB0_7:

From godbolt: https://godbolt.org/z/DUDl-T

Bench comparison with 1.23 nightly (2017-11-20, 9 days after PR merge)

 name                        old_1.23 ns/iter  new_1.23 ns/iter  diff ns/iter   diff %  speedup 
 binary_search_l1            70                28                         -42  -60.00%   x 2.50 
 binary_search_l1_with_dups  46                28                         -18  -39.13%   x 1.64 
 binary_search_l2            94                42                         -52  -55.32%   x 2.24 
 binary_search_l2_with_dups  68                42                         -26  -38.24%   x 1.62 
 binary_search_l3            267               234                        -33  -12.36%   x 1.14 
 binary_search_l3_with_dups  198               236                         38   19.19%   x 0.84

and status quo with 1.30 nightly (2018-08-15)

 name                        old_1.30 ns/iter  new_1.30 ns/iter  diff ns/iter   diff %  speedup 
 binary_search_l1            75                68                          -7   -9.33%   x 1.10 
 binary_search_l1_with_dups  46                61                          15   32.61%   x 0.75 
 binary_search_l2            99                89                         -10  -10.10%   x 1.11 
 binary_search_l2_with_dups  71                89                          18   25.35%   x 0.80 
 binary_search_l3            272               245                        -27   -9.93%   x 1.11 
 binary_search_l3_with_dups  212               256                         44   20.75%   x 0.83

Answer 1 · 2018-08-31T17:17:54.000Z

The likely culprit is the upgrade from LLVM 4 to 6. The upgrade was merged on the 10th of Feb. The binary search was fast on the 9th and slow on the 12th.

@estebank The optimization did hit stable (if only for 2 releases), so does it not need the regression-from-stable-to-stable label?

Answer 2 · 2018-08-31T17:40:47.000Z

@Emerentius what do you mean for only 2 releases? Isn't the slowdown present in stable through nightly? Reading the assembly from godbolt it seems to me like there have been some changes but the code causing the slowdown presented in this issue is still there. If that is the case, then it would need that label, but if it was fixed in current stable, then we can close the ticket, as I do not believe we cut point releases for previous stable releases.

Answer 3 · 2018-08-31T19:38:54.000Z

@estebank I mean it was fast in 1.23 stable and slow from 1.25 stable on, so fast for 2 stable releases, then regressed.

Answer 4 · 2018-11-29T14:24:21.000Z

triage: P-medium, since I don't think investigating this takes priority over other things currently being juggled. but also self-assigning since I'm curious and I think I can do some bisection as a background task, assuming I can reproduce the perf regression locally.

Answer 5 · 2018-11-29T15:56:06.000Z

discussed in T-compiler meeting. Re-prioritizing as P-high just for evaluating the scope of the regression (though some good work on that is already evident in this comment thread).

Answer 6 · 2018-12-12T15:54:55.000Z

Not sure why I previously removed I-slow; I'm putting it back now.

Answer 7 · 2018-12-12T16:05:22.000Z

Hey @Emerentius, what is the distinction you are making in the columns with the prefix "old_" and "new_" in the tables you present?

That is, I would have thought that "old_" somehow corresponds to the state prior to the changes from PR #45333 and "new_" corresponds to after PR #45333 landed; but you already established that PR #45333 landed in between the two versions that each have "old_" and "new_" labels applied to them.

So I don't understand what difference has been applied to either the code or the benchmark in between the two columns there. (Hypothesis: Is "old_" testing the binary search implementation prior to PR #45333, and "new_" testing the binary search implementation that was added by PR #45333 ?)

Answer 8 · 2018-12-12T16:08:16.000Z

(In any case I definitely am able to reproduce a regression on my Linux desktop on benches/slice between nightly-2018-02-09 and 2018-02-12, as anticipated/noted by #53823 (comment) ; there's a fair amount of noise between runs when looking at binary_search_l3 in my own machine, but binary_search_l1 is much more consistent in illustrating a 2x regression for me.)

Answer 9 · 2018-12-12T17:22:08.000Z

@pnkfelix

Hypothesis: Is "old_" testing the binary search implementation prior to PR #45333, and "new_" testing the binary search implementation that was added by PR #45333 ?

That is exactly it

Answer 10 · 2018-12-13T13:13:32.000Z

Okay so I've done some preliminary analysis of the scope of the problem.

On my desktop linux machine, I built the benchmark suite in libcore/benches atop the nightly compilers from 2018-02-09 and 2018-02-12. Then I did three runs of the suite on each compiler (for a total of six runs).

Update: On the LLVM ticket associated with this issue that was subsequently filed, someone noted that I didn't indicate the processor model where I gathered this data. Here that is, in a details block.

% cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 60
model name      : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
stepping        : 3
microcode       : 0x25
cpu MHz         : 1728.267
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_goo\
d nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdra\
nd lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts flush_l\
1d
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips        : 7183.27
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 60
model name      : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
stepping        : 3
microcode       : 0x25
cpu MHz         : 1545.735
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 1
cpu cores       : 4
apicid          : 2
initial apicid  : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_goo\
d nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdra\
nd lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts flush_l\
1d
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips        : 7183.27
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 2
vendor_id       : GenuineIntel
cpu family      : 6
model           : 60
model name      : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
stepping        : 3
microcode       : 0x25
cpu MHz         : 1588.301
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 2
cpu cores       : 4
apicid          : 4
initial apicid  : 4
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_goo\
d nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdra\
nd lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts flush_l\
1d
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips        : 7183.27
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 3
vendor_id       : GenuineIntel
cpu family      : 6
model           : 60
model name      : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
stepping        : 3
microcode       : 0x25
cpu MHz         : 2202.815
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 3
cpu cores       : 4
apicid          : 6
initial apicid  : 6
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_goo\
d nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdra\
nd lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts flush_l\
1d
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips        : 7183.27
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 4
vendor_id       : GenuineIntel
cpu family      : 6
model           : 60
model name      : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
stepping        : 3
microcode       : 0x25
cpu MHz         : 1990.709
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4
apicid          : 1
initial apicid  : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_goo\
d nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdra\
nd lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts flush_l\
1d
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips        : 7183.27
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 5
vendor_id       : GenuineIntel
cpu family      : 6
model           : 60
model name      : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
stepping        : 3
microcode       : 0x25
cpu MHz         : 2112.084
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 1
cpu cores       : 4
apicid          : 3
initial apicid  : 3
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_goo\
d nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdra\
nd lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts flush_l\
1d
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips        : 7183.27
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 6
vendor_id       : GenuineIntel
cpu family      : 6
model           : 60
model name      : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
stepping        : 3
microcode       : 0x25
cpu MHz         : 1789.977
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 2
cpu cores       : 4
apicid          : 5
initial apicid  : 5
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_goo\
d nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdra\
nd lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts flush_l\
1d
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips        : 7183.27
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 60
model name      : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
stepping        : 3
microcode       : 0x25
cpu MHz         : 2148.265
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 3
cpu cores       : 4
apicid          : 7
initial apicid  : 7
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_goo\
d nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdra\
nd lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts flush_l\
1d
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips        : 7183.27
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

Here is the raw data from that exercise: https://gist.github.com/8f511af7012c750928a6bdbf8efa3e6f

Then I put it into a spreadsheet, took the min (of the three runs) of the reported ns/iter for each benchmark, and asked to see the ratio (i.e. slowdown, where 2.0 is a 2x slowdown, 1.0 is no change, and 0.5 is a 2x speedup), comparing the min for 2018-02-09 to 2018-02-12.

(One might argue I should use average instead of minimum... but for understanding compiler output, I go along with the argument presented here that usually the outliers are much slower than normal, rather than faster, and so taking the minimum is more likely to filter out transient noise rather than artificial improvement)

Finally, when looking at the slowdown ratio, I asked for the values > 1.2 to be highlighted in red (so that we'd pay special attention to a 1.2x slowdown) and the values < 0.9 to be highlighted in green (so that we'd also notice reasonable speedups).

Here is that spreadsheet: https://www.dropbox.com/s/qznzchqhibi5o67/three-runs.xlsx?dl=0

Answer 11 · 2018-12-13T13:32:52.000Z

A number of benchmarks improved significantly (i.e. the slowdown ratio is <0.9) over the time period specified. I'm not going to list them here, but there are ~44 of them.

But here are the 9 cases that regressed:

test	2018-02-09 ns/iter (min of 3)	2018-02-12 ns/iter (min of 3)	slowdown
char::methods::bench_to_digit_radix_10	16,208	20,078	1.24x
iter::bench_zip_then_skip	2,580	3,899	1.51x
num::bench_u32_from_str_radix_2	17,896	23,223	1.30x
slice::binary_search_l1	21	51	2.43x
slice::binary_search_l1_with_dups	21	46	2.19x
slice::binary_search_l2	31	67	2.16x
slice::binary_search_l2_with_dups	31	67	2.16x
slice::binary_search_l3	108	141	1.31x
slice::binary_search_l3_with_dups	109	142	1.30x

Obviously the latter six cases are the ones that were already being pointed out by this bug.

(The other three entries in the table may or may not have the same core cause.)

Obviously the libcore/benches benchmark should not be the sole decider about whether a regression is significant, but based on these results, the impact does seem to be somewhat restricted in scope...

Answer 12 · 2018-12-14T12:59:50.000Z

Also, there are some pretty interesting discussions among LLVM developers about when they turn cmov into branches. It sounds there are a number of scenarios where this exhibits better performance due to (I think) either 1. the processor's branch prediction or 2. an expensive test condition that you'd rather evaluate just once

Answer 13 · 2018-12-14T13:12:48.000Z

In particular, in the LLVM developer discussions linked above, I saw mention of the following flag: -x86-cmov-converter=false

I gave that a whirl, and the results seem promising:

% rustc +nightly-2018-02-12 -O -C llvm-args=-x86-cmov-converter=false --test lib.rs -o ../benches-2018-02-12.nocmovconv
% rustc +nightly-2018-02-12 -O                                        --test lib.rs -o ../benches-2018-02-12
% rustc +nightly-2018-02-09 -O                                        --test lib.rs -o ../benches-2018-02-09
% ../benches-2018-02-09 --bench slice

running 6 tests
test slice::binary_search_l1                               ... bench:          21 ns/iter (+/- 0)
test slice::binary_search_l1_with_dups                     ... bench:          21 ns/iter (+/- 0)
test slice::binary_search_l2                               ... bench:          31 ns/iter (+/- 0)
test slice::binary_search_l2_with_dups                     ... bench:          31 ns/iter (+/- 0)
test slice::binary_search_l3                               ... bench:         110 ns/iter (+/- 1)
test slice::binary_search_l3_with_dups                     ... bench:         112 ns/iter (+/- 7)

test result: ok. 0 passed; 0 failed; 0 ignored; 6 measured; 142 filtered out

% ../benches-2018-02-12 --bench slice

running 6 tests
test slice::binary_search_l1                               ... bench:          51 ns/iter (+/- 0)
test slice::binary_search_l1_with_dups                     ... bench:          46 ns/iter (+/- 0)
test slice::binary_search_l2                               ... bench:          67 ns/iter (+/- 1)
test slice::binary_search_l2_with_dups                     ... bench:          67 ns/iter (+/- 0)
test slice::binary_search_l3                               ... bench:         143 ns/iter (+/- 1)
test slice::binary_search_l3_with_dups                     ... bench:         142 ns/iter (+/- 0)

test result: ok. 0 passed; 0 failed; 0 ignored; 6 measured; 142 filtered out

% ../benches-2018-02-12.nocmovconv --bench slice

running 6 tests
test slice::binary_search_l1                               ... bench:          21 ns/iter (+/- 0)
test slice::binary_search_l1_with_dups                     ... bench:          21 ns/iter (+/- 0)
test slice::binary_search_l2                               ... bench:          31 ns/iter (+/- 1)
test slice::binary_search_l2_with_dups                     ... bench:          31 ns/iter (+/- 0)
test slice::binary_search_l3                               ... bench:         114 ns/iter (+/- 2)
test slice::binary_search_l3_with_dups                     ... bench:         115 ns/iter (+/- 1)

test result: ok. 0 passed; 0 failed; 0 ignored; 6 measured; 142 filtered out

Answer 14 · 2018-12-14T13:17:15.000Z

To be clear: I am not suggesting that we should turn the x86-cmov-converter pass off by default in rustc.

But:

we should check whether the improved version from https://reviews.llvm.org/D36081 is in our LLVM (and if not, consider cherry-picking it over), and
if we can reproduce this issue on the current LLVM version, then we should file an issue with them upstream.

Answer 15 · 2018-12-14T13:27:36.000Z

Just checked, that patch is in our LLVM, and this still reproduces with current LLVM (running generated IR through llc).

Answer 16 · 2018-12-14T13:53:51.000Z

~~I think the issue here might be that cmov conversion does not realize that it will not be possible to sink the cmovd chain into the created branches, because the condition also depends on them.~~ I misunderstood what this pass is intended to do. It's not interested in hoisting anything into the branches, the only point is to remove the dependency on the branch condition through speculation.

Answer 17 · 2018-12-14T17:51:12.000Z

I've filed an LLVM issue: https://bugs.llvm.org/show_bug.cgi?id=40027

Answer 18 · 2018-12-17T11:48:35.000Z

Okay at this point I think I've convinced myself that the scope of this problem is somewhat limited,

So I'm going to downgrade this to P-medium.

It also isn't really going to be addressed on our end (except perhaps via cherry-picking a patch for the LLVM issue).

So I'm tempted to close it, but I'll maybe wait to let other members of @rust-lang/compiler weigh in before I do that.

Answer 19 · 2018-12-17T11:50:39.000Z

I'll leave myself assigned for now, but mostly as a reminder to close it in the future.

If someone else wants to take this issue and run with it, feel free to assign yourself and unassign me.

Answer 20 · 2018-12-17T13:32:36.000Z

I think the main actionable point here is to try and change the binary search implementation in a way that does not run into this LLVM issue. Not sure how that would look like, but something that interchanges the order of the compare and the load (i.e. already load for the next iteration) might work, because this is a pattern that LLVM specifically checks for.

Answer 21 · 2018-12-17T22:40:10.000Z

Putting this on T-libs if they can find a libs-level solution.

Answer 22 · 2019-01-28T09:50:52.000Z

If somebody knows a workaround, I'm interested.

Intersection queries in tantivy are roughly 10% slower because of this.
In nightly, one can force cmov with inline asm!.
For instance, in the case of binary search.

(I personally cannot use nightly.)

fn binary_search(arr: &[u32], mut base: usize, mut len: usize, target: u32) -> usize {
    while len > 1 {
        let half = len / 2;
        let mid = base + half;
        unsafe {
            let pivot: u32 = *arr.get_unchecked(mid);
            asm!("cmpl $2, $1\ncmovge $3, $0"
                 : "+r"(base)
                 :  "r"(target),  "r"(pivot), "r"(mid))
            ;
        }
        len -= half;
    }
    base + ((*unsafe { arr.get_unchecked(base) } < target) as usize)
}

Answer 23 · 2019-02-21T11:28:05.000Z

@fulmicoton have you tried passing -C llvm-args=-x86-cmov-converter=false to rustc? I noted here above that this seems to recover most of the lost performance on the relevant slice benchmarks.

Answer 24 · 2019-05-17T10:40:00.000Z

Unassigning self: I don't think there's much for me to do here from the viewpoint of rustc compiler, apart from hypothetically turning off x86-cmov-converter by default, which seems like a bad idea.

Answer 25 · 2021-07-03T19:53:28.000Z

I'm not fluent in asm but it looks as though the problem still exists.

1.24:

.LBB0_4:
        mov     r9, rdx
        shr     r9
        lea     rax, [r9 + r8]
        cmp     qword ptr [rsi + 8*rax], rcx
        cmova   rax, r8
        sub     rdx, r9
        cmp     rdx, 1
        mov     r8, rax
        ja      .LBB0_4

1.55-nightly:

.LBB0_3:
        shr     rdx
        add     rdx, rsi
        cmp     qword ptr [rdi + 8*rdx], r8
        jb      .LBB0_6
        je      .LBB0_9
        mov     rcx, rdx
        sub     rdx, rsi
        ja      .LBB0_3

godbolt

Answer 26 · 2021-07-04T07:58:24.000Z

Note that #74024 significantly changed the binary_search implementation in the meantime so they are no longer comparable. Instead you'll have to copy the old implementation into godbolt to retain comparability. That said, the issue still exists there too.

https://godbolt.org/z/8cqKdxG4K

pub fn binary_search(slice: &[u64], key: u64) -> Result<usize, usize> {
    binary_search_by(slice, |val| val.cmp(&key) )
}

use std::cmp::Ordering::{self, Greater, Less, Equal};

fn binary_search_by<'a, T, F>(slice: &'a [T], mut f: F) -> Result<usize, usize>
    where F: FnMut(&'a T) -> Ordering
{
    let s = slice;
    let mut size = s.len();
    if size == 0 {
        return Err(0);
    }
    let mut base = 0usize;
    while size > 1 {
        let half = size / 2;
        let mid = base + half;
        // mid is always in [0, size).
        // mid >= 0: by definition
        // mid < size: mid = size / 2 + size / 4 + size / 8 ...
        let cmp = f(unsafe { s.get_unchecked(mid) });
        base = if cmp == Greater { base } else { mid };
        size -= half;
    }
    // base is always in [0, size) because base <= mid.
    let cmp = f(unsafe { s.get_unchecked(base) });
    if cmp == Equal { Ok(base) } else { Err(base + (cmp == Less) as usize) }
}

Answer 27 · 2021-07-04T11:32:08.000Z

Putting this on T-libs if they can find a libs-level solution.

I have tested several older revisions of the binary_search and partition_point implementations. All of them had their conditional moves converted to cmp + branch due to this llvm change.

This LLVM issue comment suggests that the primary heuristic which turns off this optimization only looks for binary tree searches (unpredictable pointer chasing) but not binary searches (unpredictable pointer arithmetic).

Another heuristic related to cost modelling has been added. Modifying its threshold via -C llvm-args=-x86-cmov-converter-threshold=10 does revert the old binary search implementation (see previous comment) to conditional moves. This doesn't work for the current, short-circuiting implementation from #74024 (but running that one on rust 1.24 does produce cmovs).
I don't think there's anything we could do on the libs side to trigger that heuristic with its current default (4).

The options I currently see are:

wait for an llvm fix (might require a new unpredictable intrinsic)
special-case x86 with assembly
tweak llvm-args

Only option 2 would be a pure libs thing. Some of those options would still require changes to the current implementation and benchmark runs. The old while size > 1 approach seems to be easier to turn into cmovs than current while left < right one. Or separate implementations partition_point and binary_search may be needed.

Answer 28 · 2021-07-19T15:52:33.000Z

We discussed this issue in our last libs team meeting and concluded that there's nothing left for libs to do, and that we need to wait for the fix to happen upstream in https://bugs.llvm.org/show_bug.cgi?id=40027. Our plan is to hand this over to the compiler team for tracking as discussed in https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/tracking.20an.20old.20regression. For now we've tagged both teams and nominated it so that the compiler team can confirm the handover in their next meeting.

Answer 29 · 2021-07-22T14:00:27.000Z

@yaahc @joshtriplett I'm happy to confirm handover to T-compiler, but the prioritization of this (currently P-low; P-medium as of 8 days ago) essentially means that we're never going to prioritize it for discussion in a meeting. It will just be task hanging out there for interested people to look at and discuss.

So, I just wanted to confirm: Is that an acceptable outcome from the point of view of you all? Or do you think this should be assigned higher priority?

Answer 30 · 2021-07-22T18:28:49.000Z

@pnkfelix That's roughly the outcome we were expecting.

We do feel that this would be worthwhile to have on the radar of a team whose primary purview is LLVM-related fixes that would benefit Rust, but we also recognize that that's a "somebody should", and we don't have a somebody to offer.

Answer 31 · 2021-07-22T23:44:11.000Z

What Josh said, I've gone ahead and removed the T-libs label, thanks for the confirmation @pnkfelix ^_^.

Answer 32 · 2023-09-01T00:12:02.000Z

Could something like https://docs.rs/cmov/latest/cmov/ be used in the interim to force the usage of cmov to recover the performance?

Answer 33 · 2023-09-01T00:19:56.000Z

Is profile guided optimization able to do anything in this case maybe?

Answer 34 · 2023-09-14T07:52:42.000Z

The x86 backend should now respect __builtin_unpredicate for cmov conversion, so if we can expose that in rustc in some way, we can use that.

Answer 35 · 2023-09-14T10:33:00.000Z

There's a discussion what to do about likely()/unlikely() on IRLO. Adding an unpredictable() was mentioned too.

Answer 36 · 2024-07-26T23:40:47.000Z

See #128254 for a potential fix.