Just introducing an intermediate variable speeds up code by ~4%

Question

Just introducing an intermediate variable speeds up code by ~4%

Opened this issue 2 months ago · 2 comments

I tried this code:

https://www.godbolt.org/z/exEqG7r1n

use std::ops::Index;
struct AssociationList<K, V> {
    pairs: Vec<AssociationPair<K, V>>,
}

#[derive(Clone)]
struct AssociationPair<K, V> {
    key: K,
    value: V,
}

impl<K, V> AssociationList<K, V> {
    fn push(&mut self, key: K, value: V) {
        self.pairs.push(AssociationPair {
            key: key,
            value: value,
        });
    }
}

impl<'a, K: PartialEq + std::fmt::Debug, V: Clone> Index<&'a K> for AssociationList<K, V> {
    type Output = V;
    fn index(&self, index: &K) -> &V {
        for pair in &self.pairs {
            if pair.key == *index {
                return &pair.value;
            }
        }
        panic!("No value found for key: {:?}", index);
    }
}

pub fn main() {
    let foo = std::rc::Rc::new("foo".to_string());
    let bar = "bar".to_string();
    let mut list = AssociationList { pairs: Vec::new() };
    list.push((*foo).clone(), 22);
    list.push(bar.clone(), 44);
    assert_eq!(list[&(*foo)], 22);
    assert_eq!(list[&bar], 44);
    assert_eq!(list[&(*foo)], 22);
    assert_eq!(list[&bar], 44);
}

and:

use std::ops::Index;
struct AssociationList<K, V> {
    pairs: Vec<AssociationPair<K, V>>,
}

#[derive(Clone)]
struct AssociationPair<K, V> {
    key: K,
    value: V,
}

impl<K, V> AssociationList<K, V> {
    fn push(&mut self, key: K, value: V) {
        self.pairs.push(AssociationPair {
            key: key,
            value: value,
        });
    }
}

impl<'a, K: PartialEq + std::fmt::Debug, V: Clone> Index<&'a K> for AssociationList<K, V> {
    type Output = V;
    fn index(&self, index: &K) -> &V {
        for pair in &self.pairs {
            if pair.key == *index {
                return &pair.value;
            }
        }
        panic!("No value found for key: {:?}", index);
    }
}

pub fn main() {
    let foo = std::rc::Rc::new("foo".to_string());
    let bar = "bar".to_string();
    let mut list = AssociationList { pairs: Vec::new() };
    list.push((*foo).clone(), 22);
    let temp_bridge_4831 = bar;
    list.push(temp_bridge_4831.clone(), 44);
    assert_eq!(list[&(*foo)], 22);
    assert_eq!(list[&temp_bridge_4831], 44);
    assert_eq!(list[&(*foo)], 22);
    assert_eq!(list[&temp_bridge_4831], 44);
}

I expected to see this happen:

Both code versions only differ in that the second one introduces an additional binding (let temp_bridge_4831 = bar;). I expected the generated assembly code and runtime performance to remain essentially the same.

Instead, this happened:

When benchmarked with hyperfine (the main function loop with 10^7 iterations), the version with the extra binding was about 4.4% faster.

Although this performance difference is not large, in Rust, using intermediate variables or transferring ownership is very common. Therefore, one may not expect such a small change to cause a noticeable difference in a high-performance scenario.

Could introducing the intermediate variable trigger some kind of unexpected optimization, which might be related to the Index trait implementation, the way indexing works, or some other subtle effect?

Meta

rustc --version --verbose:

rustc 1.92.0-nightly (9f32ccf35 2025-09-21)
binary: rustc
commit-hash: 9f32ccf35fb877270bc44a86a126440f04d676d0
commit-date: 2025-09-21
host: x86_64-unknown-linux-gnu
release: 1.92.0-nightly
LLVM version: 21.1.1

hyperfine.json (no extra binding)

{
  "results": [
    {
      "command": "./target/release/overloaded-index-assoc-list",
      "mean": 1.0572985350399997,
      "stddev": 0.004401112717279841,
      "median": 1.05744619174,
      "user": 1.0555616399999999,
      "system": 0.001539378,
      "min": 1.05054995624,
      "max": 1.06405305924,
      "times": [
        1.0512718642399999,
        1.0587477022399998,
        1.05715102424,
        1.0553098372399998,
        1.05774135924,
        1.06405305924,
        1.0553158062399999,
        1.06256787324,
        1.06027686824,
        1.05054995624
      ],
      "exit_codes": [
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0
      ]
    }
  ]
}

hyperfine.json (with extra binding)

{
  "results": [
    {
      "command": "./target/release/overloaded-index-assoc-list_new",
      "mean": 1.01132581302,
      "stddev": 0.009323974797432985,
      "median": 1.01071172272,
      "user": 1.0098468400000002,
      "system": 0.0014089679999999998,
      "min": 0.9957634287200001,
      "max": 1.02754362272,
      "times": [
        1.01354617872,
        1.02754362272,
        1.00658507772,
        1.0096771927200001,
        1.02160711572,
        0.9957634287200001,
        1.01174625272,
        1.00343215972,
        1.00540717572,
        1.01794992572
      ],
      "exit_codes": [
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0
      ]
    }
  ]
}

Answer 1 · 2025-09-25T16:19:20.000Z

I had a look at the difference in the generated assembly code.

x86-64 has "16" general-purpose registers (but really 15 because %rsp is needed by the compiler), and the main function in both versions of the code uses all of them (primarily because all but 6 of them are needed for function calls, like the call to String::clone), and still needs room to store data. This means that some of the data can't be stored in registers and needs to be stored on the stack.

Introducing temp_bridge_4831 doesn't change the semantics of the program, but I suspect it causes LLVM's register allocation algorithm to see the various temporary values used by your program in a different order. Register allocation is necessarily approximate (because doing it perfectly is NP-complete and thus probably takes exponential time), so telling it what the temporary values are in a different order may cause the algorithm to make different decisions. Although I'm not completely certain, the difference in assembly code in the two cases looks very much like it was caused by different register allocation decisions rather than by any higher-level details of the code.

In particular, it seems to have made different decisions about which values to store in registers and which to store on the stack. I can easily imagine that having a 4% performance difference, if some of the values are accessed more often than others. It does, however, seem like a difficult sort of gain to get intentionally, because perfect register allocation algorithms are unusuably slow. (The more likely way to get the same gain would be some sort of reasoning along the lines of "although according to the ABI, String::clone is allowed to use 9 registers + %rsp, it actually uses fewer than that in practice, and so I can use some of its registers for my own use" and avoid needing to do the spills altogether. But I don't think LLVM is currently capable of that sort of reasoning.)

Answer 2 · 2025-09-26T01:49:22.000Z

@ais523 Thank you very much for your comment - it has helped me a lot.