RootedRc is not much faster than Arc

Question

RootedRc is not much faster than Arc

Opened this issue 2 years ago · 6 comments

sporksmith commented 2 years ago

Well that's unfortunate, since the whole point is to be faster than Arc.

Prime suspects are:

the hash set used to track which roots are locked by the current thread. If so, could mitigate with a faster hash function (stdlib's default is known to be slow), or only supporting one root to be unlocked at a time so that it can just be an Option instead of a HashSet.
thread-local-access itself. I seem to vaguely remember seeing somewhere that stdlib's thread locals may be slower than native ones. If this is the issue we should be able to find or make an alternative that e.g. just uses libc's native thread locals (assuming there's not some fundamental safety issue preventing that)

Probable next step is to run perf to verify where the time is actually being spent.

Here are the current numbers:

$ cargo bench
...
RootedRc::clone 1000    time:   [28.890 us 29.105 us 29.332 us]                                  
                        change: [-1.7329% -1.1690% -0.5940%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

Arc::clone 1000         time:   [8.0026 us 8.0184 us 8.0347 us]                             
                        change: [-0.6287% -0.2630% +0.1302%] (p = 0.20 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

EDIT: deleted invalid Rc benchmark (operations under test were optimized away)

Answer 1 · 2022-05-14T17:59:57.000Z

Locally changing the HashSet to an Option does make it faster than Arc, but not by much:

RootedRc::clone 1000    time:   [5.1394 us 5.1490 us 5.1592 us]                                  
Arc::clone 1000         time:   [7.9881 us 8.0049 us 8.0236 us]

EDIT: deleted invalid Rc benchmark (operations under test were optimized away)

Answer 2 · 2022-05-14T18:14:33.000Z

Using the unstable non-lazily-initialized thread-locals in addition to using Option instead of HashSet does further improve performance, but still only a marginal improvement vs Arc:

RootedRc::clone 1000    time:   [3.8106 us 3.8202 us 3.8304 us]                                  
Arc::clone 1000         time:   [8.1084 us 8.1633 us 8.2170 us]

EDIT: pushed these changes to performance-hacks branch for reference. Above measurement was @ 5f409ef
EDIT: deleted invalid Rc benchmark (operations under test were optimized away)

Answer 3 · 2022-05-15T00:47:26.000Z

I iterated on the benchmarks - it looks like Rc::clone was getting optimized away, which is why it appears so ridiculously fast here. I also removed the internal 1000x loop from the benchmarks, since the criterion framework itself does its own looping and has support for moving setup and teardown outside of the benchmark loop.

Switching the hasher for hashset to a more efficient one helped quite a bit. Changing the design to not require a hashset at all helped a bit more. It's still only a little faster than Arc.

Arc is "only" ~10x slower than Rc as well. "Just use Arc" is looking better all the time, though maybe it's worth sketching out a RootedRefCell and comparing that to RefCell and Mutex as well.

Current results on main, at 3a194ff:

clone/RootedRc          time:   [8.4849 ns 8.5352 ns 8.5831 ns]                            
clone/Arc               time:   [10.518 ns 10.567 ns 10.616 ns]                      
clone/Rc                time:   [1.4322 ns 1.5442 ns 1.6443 ns]                      
drop/RootedRc           time:   [16.882 ns 16.960 ns 17.043 ns]                           
drop/Arc                time:   [22.843 ns 22.968 ns 23.110 ns]                     
drop/Rc                 time:   [13.845 ns 13.937 ns 14.036 ns]

Answer 4 · 2022-05-15T00:50:08.000Z

Updated the benchmarks on the "performance-hacks" branch as well. That branch still has the advantage of using the (unstable) faster thread-locals. It helps a bit, but not that much:

@ 659e813

clone/RootedRc          time:   [6.8440 ns 6.8960 ns 6.9478 ns]                            
drop/RootedRc           time:   [15.282 ns 15.351 ns 15.424 ns]

Answer 5 · 2022-05-15T15:50:40.000Z

Added a "cross-core clone" benchmark. This creates 1000 Arc or RootedRc on one core, and then clones them from every other CPU core (using CPU pinning). (The 1000x is needed to amortize the cost of setting up the thread, affinity, etc; it'd be complex to use Criterion to separate those costs). This should make the atomic operations more expensive.

This seems to show slightly more benefit for RootedRd, though the amount of benefit beyond the single-core case is close to being in the noise. Ultimately the total performance benefit vs Arc still seems likely to be marginal in realistic usage.

On the "performance hacks" branch @ e74aabe:

clone/RootedRc          time:   [6.7978 ns 6.8425 ns 6.8857 ns]                            
clone/Arc               time:   [10.508 ns 10.557 ns 10.606 ns]                      
clone/Rc                time:   [1.1069 ns 1.1528 ns 1.2005 ns]   
                   
cross-core clone/RootedRc time:   [3.3485 ms 3.4909 ms 3.6510 ms]
cross-core clone/Arc    time:   [5.4642 ms 5.7656 ms 6.0682 ms]

EDIT: There's no cross-core Rc test because these tests use separate threads to create the objects in setup vs doing the actual work, and Rc's can't be sent across threads.

Answer 6 · 2022-08-12T21:02:34.000Z

I added a RootedRc::fast_clone that takes a reference to the lock, and so doesn't have to access the thread-local.

I also found and fixed a problem in all of the clone benchmarks - the cloned value was getting dropped inside the benchmark timing. This didn't matter too much before fast_clone, but since the drop function still has to access the thread local, it made fast_clone not look like as much of an improvement.

RootedRc::fast_clone is about as fast as Rc::clone, which is as expected. The same principle should work for a "fast drop" method.

clone/RootedRc                time:   [4.0448 ns 4.1450 ns 4.2579 ns]                           
clone/RootedRc fast_clone     time:   [1.8249 ns 1.8565 ns 1.8865 ns]                                                                       
clone/Arc                     time:   [4.5357 ns 4.5478 ns 4.5600 ns]
clone/Rc                      time:   [1.2237 ns 1.2721 ns 1.3381 ns]

Or normalizing the median times by Rc::clone and sorting:

Rc::clone:            1.0
RootedRc::fast_clone: 1.5
RootedRc::clone:      3.3
Arc::clone:           3.6

After implementing a "fast drop", this issue will probably be ready to be closed.