rust-lang/rust

Missed Optimization: Inefficient Handling of Mutable Reference in Simple Conditional Assignment

paolobarbolini opened this issue · 5 comments

The following example shows how LLVM isn't able to optimize trivial branches when dealing with &mut pointers, while it can with owned values. I'd expect both functions to override current_len with 0 without needing to branch.

#[no_mangle]
pub fn unoptimized(current_len: &mut u32) {
    if *current_len != 0 {
        *current_len = 0;
    }
}

#[no_mangle]
pub fn optimized(mut current_len: u32) -> u32 {
    if current_len != 0 {
        current_len = 0;
    }
    current_len
}
unoptimized:
        cmp     dword ptr [rdi], 0
        je      .LBB0_2
        mov     dword ptr [rdi], 0
.LBB0_2:
        ret

optimized:
        xor     eax, eax
        ret
define void @unoptimized(ptr noalias nocapture noundef align 4 dereferenceable(4) %current_len) unnamed_addr {
start:
  %_2 = load i32, ptr %current_len, align 4
  %0 = icmp eq i32 %_2, 0
  br i1 %0, label %bb3, label %bb1

bb1:
  store i32 0, ptr %current_len, align 4
  br label %bb3

bb3:
  ret void
}

define noundef i32 @optimized(i32 noundef %0) unnamed_addr {
start:
  ret i32 0
}
nikic commented

Whether this optimization is legal is still an open question, but the current answer seems to be leaning to "no". See #114886 for some entry points to that discussion. (cc @RalfJung)

Cc @rust-lang/opsem

Yeah, under some of the considered models for Rust references (in particular, Tree Borrows), if a function never writes to a mutable reference, then the compiler cannot make it write. This makes some unsafe code a lot easier to reason about, greatly reducing the risk of UB. But it also means we can't hoist the write out of the if in this example.

rust-lang/unsafe-code-guidelines#133 is probably the best UCG issue for this question.

taralx commented

I'm not clear on what the optimizer is supposed to do here. Unconditionally write?

comex commented

An unconditional write is smaller and often faster, but it's not always faster. If the cache line is not already dirty, writing will dirty it, which requires the CPU to get an exclusive lock on the cache line (could be slow if there's contention with other cores) and also requires the line to eventually be written back to main memory (using up memory throughput).

This doesn't apply to the second example function, where the argument is passed and returned in a register.