sparsemat/sprs

`MulAcc` causes significant speed regression

tomtung opened this issue · 9 comments

I just found out that my implementation of an ML model has been training up to several times slower than before. After some debugging, turns out that the cause was #276, where the MulAcc trait was introduced in d56bd72, and specifically where the Copy bound was removed in d2f0da7. For details please see tomtung/omikuji#32.

My intuitive guess is that the trait MulAcc got rid of the Copy bound at the cost of forcing the values to be passed as references (pointers). This might have made it hard for the compiler to do inlining and/or other optimizations. since the += operation is typically in the innermost part of the loop, any slow-down in such hot spots would result in significant speed regression.

I don't have context on the introduction of MulAcc, so I'm not sure what's the right way to fix this. Let me know if there's something I could help with.

I tried confirming the speed regression with a quick & scrappy benchmark.

I first cloned sprs locally, and patch it to add back the previous csr_mulacc_dense_colmaj implementation with a suffix _old:

--- a/src/sparse/prod.rs
+++ b/src/sparse/prod.rs
@@ -295,6 +295,46 @@ pub fn csc_mulacc_dense_colmaj<'a, N, A, B, I, Iptr>(
     }
 }

+/// CSR-dense colmaj multiplication
+///
+/// Performs better if rhs has few columns.
+pub fn csr_mulacc_dense_colmaj_old<'a, N1, N2, NOut, I, Iptr>(
+    lhs: CsMatViewI<N1, I, Iptr>,
+    rhs: ArrayView<N2, Ix2>,
+    mut out: ArrayViewMut<'a, NOut, Ix2>,
+) where
+    N1: 'a + Num + Copy,
+    N2: 'a + Num + Copy,
+    NOut: 'a + Num + Copy,
+    N1: std::ops::Mul<N2, Output = NOut>,
+    I: 'a + SpIndex,
+    Iptr: 'a + SpIndex,
+{
+    if lhs.cols() != rhs.shape()[0] {
+        panic!("Dimension mismatch");
+    }
+    if lhs.rows() != out.shape()[0] {
+        panic!("Dimension mismatch");
+    }
+    if rhs.shape()[1] != out.shape()[1] {
+        panic!("Dimension mismatch");
+    }
+    if !lhs.is_csr() {
+        panic!("Storage mismatch");
+    }
+    let axis1 = Axis(1);
+    for (mut ocol, rcol) in out.axis_iter_mut(axis1).zip(rhs.axis_iter(axis1)) {
+        for (orow, lrow) in lhs.outer_iterator().enumerate() {
+            let mut prev = ocol[[orow]];
+            for (rrow, &lval) in lrow.iter() {
+                let rval = rcol[[rrow]];
+                prev = prev + lval * rval;
+            }
+            ocol[[orow]] = prev;
+        }
+    }
+}
+
 /// CSR-dense colmaj multiplication
 ///
 /// Performs better if rhs has few columns.

then I modified Cargo.toml a bit

--- a/Cargo.toml
+++ b/Cargo.toml
@@ -29,6 +29,7 @@ smallvec = "1.4.0"
 rayon = { version = "1.3.0", optional = true }
 num_cpus = { version = "1.13.0", optional = true }
 approx = { version = "0.5", optional = true }
+rand = { version = "0.8", default-features = false, features = ["small_rng"] }

 [dev-dependencies]
 num-derive = "0.3"
@@ -37,7 +38,6 @@ tempfile = "3.1.0"
 bincode = "1.2.0"
 tobj = "3.0"
 image = { version = "0.23.0", default-features = false, features = ["png"] }
-rand = { version = "0.8", default-features = false, features = ["small_rng"] }

 [[bench]]
 name = "suite"
@@ -51,6 +51,11 @@ harness = false
 name = "sorting"
 harness = false

+[[bin]]
+name = "scrappy_bench"
+path = "src/scrappy_bench.rs"
+
+
 [workspace]
 members = [
     "sprs-ldl",

Finally, I wrote the following simple benchmark program to src/scrappy_bench.rs:

extern crate rand;
extern crate sprs;
extern crate ndarray;

use rand::Rng;
use sprs::prod::{csr_mulacc_dense_colmaj, csr_mulacc_dense_colmaj_old};

fn generate_random_csr(shape: sprs::Shape) -> sprs::CsMat<f32> {
    let mut rng = rand::thread_rng();
    let (rows, cols) = shape;
    let nnz = rng.gen_range(1..rows * cols / 2);

    let mut mat = sprs::TriMat::<f32>::new(shape);
    for _ in 0..nnz {
        let r = rng.gen_range(0..rows);
        let c = rng.gen_range(0..cols);
        let v = rng.gen::<f32>();
        mat.add_triplet(r, c, v);
    }

    mat.to_csr()
}

fn generate_random_array(shape: sprs::Shape) -> ndarray::Array2<f32> {
    let mut rng = rand::thread_rng();
    let data = (0..shape.0 * shape.1).map(|_| rng.gen()).collect::<Vec<f32>>();
    ndarray::Array2::from_shape_vec(shape, data).unwrap()
}

fn main() {
    const N_RUNS_PER_LOOP: usize = 1000;
    const N_LOOP: usize = 10;
    const MAX_LEFT_DIM: usize = 200;
    const MAX_RIGHT_DIM: usize = 200;
    const MAX_INNER_DIM: usize = 5000;

    let mut rng = rand::thread_rng();
    for _ in 0..N_LOOP {
        let left_shape: sprs::Shape = (
            rng.gen_range(10..=MAX_LEFT_DIM),
            rng.gen_range(10..=MAX_INNER_DIM),
        );
        let right_shape: sprs::Shape = (
            left_shape.1,
            rng.gen_range(1..=MAX_RIGHT_DIM),
        );
        let rand_csr = generate_random_csr(left_shape);
        let rand_vec = generate_random_array(right_shape);
        let mut out = ndarray::Array2::<f32>::zeros((left_shape.0, right_shape.1));

        let start_t = std::time::Instant::now();
        for _ in 0..N_RUNS_PER_LOOP {
            csr_mulacc_dense_colmaj(rand_csr.view(), rand_vec.view(), out.view_mut());
        }
        println!("New\t{}ms", start_t.elapsed().as_millis());

        let start_t = std::time::Instant::now();
        for _ in 0..N_RUNS_PER_LOOP {
            csr_mulacc_dense_colmaj_old(rand_csr.view(), rand_vec.view(), out.view_mut());
        }
        println!("Old\t{}ms", start_t.elapsed().as_millis());
        println!("");
    }
}

The benchmark can then be run by cargo run --release scrappy_bench. Here's the result of one such run:

New     4846ms
Old     1874ms

New     450ms
Old     166ms

New     12071ms
Old     4460ms

New     1651ms
Old     581ms

New     60298ms
Old     38835ms

New     61863ms
Old     21512ms

New     7074ms
Old     2528ms

New     32ms
Old     10ms

New     5847ms
Old     2168ms

New     12322ms
Old     4290ms

As you can see, the new implementation with MulAcc is indeed significantly slower than the old one.

Do the numbers change significantly if you enable lto?

Yep, the difference seems to widen. With:

--- a/Cargo.toml
+++ b/Cargo.toml
@@ -29,6 +29,7 @@ smallvec = "1.4.0"
 rayon = { version = "1.3.0", optional = true }
 num_cpus = { version = "1.13.0", optional = true }
 approx = { version = "0.5", optional = true }
+rand = { version = "0.8", default-features = false, features = ["small_rng"] }

 [dev-dependencies]
 num-derive = "0.3"
@@ -37,7 +38,6 @@ tempfile = "3.1.0"
 bincode = "1.2.0"
 tobj = "3.0"
 image = { version = "0.23.0", default-features = false, features = ["png"] }
-rand = { version = "0.8", default-features = false, features = ["small_rng"] }

 [[bench]]
 name = "suite"
@@ -51,6 +51,14 @@ harness = false
 name = "sorting"
 harness = false

+[[bin]]
+name = "scrappy_bench"
+path = "src/scrappy_bench.rs"
+
+[profile.release]
+lto = true
+
+
 [workspace]
 members = [
     "sprs-ldl",

I get:

New     2410ms
Old     730ms

New     2870ms
Old     743ms

New     197ms
Old     62ms

New     43ms
Old     17ms

New     230ms
Old     59ms

New     4290ms
Old     1259ms

New     5172ms
Old     1578ms

New     13772ms
Old     3595ms

New     3718ms
Old     1950ms

New     7558ms
Old     2154ms

So I had a bit more fun debugging the cause of the speed regression. This is a tricky one because there turned out to be 2 causes for the slowdown:

  • Change in memory access pattern (likely to be less cache-friendly)
  • mul_add being slower than a separate multiplication operation followed by an add (see rust-lang/rust#49842)

For the former, I'm referring to the change here, where we used to copy the value of ocol[[orow]] to a local variable prev, accumulate all updates in the local variable, and copy back the new value only once when everything is done. The new implementation directly accumulates updates to ocol[[orow]] to avoid the potentially expensive .clone(). My guess is that constantly reading then immediately updating data that lives on the heap like ocol[[orow]] is a lot less cache-friendly than doing the same for a local variable that lives on the stack.

For testing purposes, we can change it back to the old behavior at the cost of adding an explicit .clone():

diff --git a/src/sparse/prod.rs b/src/sparse/prod.rs
index aa7241b1..8c721474 100644
--- a/src/sparse/prod.rs
+++ b/src/sparse/prod.rs
@@ -303,7 +343,7 @@ pub fn csr_mulacc_dense_colmaj<'a, N, A, B, I, Iptr>(
     rhs: ArrayView<B, Ix2>,
     mut out: ArrayViewMut<'a, N, Ix2>,
 ) where
-    N: 'a + crate::MulAcc<A, B>,
+    N: 'a + crate::MulAcc<A, B> + Clone,
     I: 'a + SpIndex,
     Iptr: 'a + SpIndex,
 {
@@ -322,11 +362,12 @@ pub fn csr_mulacc_dense_colmaj<'a, N, A, B, I, Iptr>(
     let axis1 = Axis(1);
     for (mut ocol, rcol) in out.axis_iter_mut(axis1).zip(rhs.axis_iter(axis1)) {
         for (orow, lrow) in lhs.outer_iterator().enumerate() {
-            let oval = &mut ocol[[orow]];
+            let mut oval = ocol[[orow]].clone();
             for (rrow, lval) in lrow.iter() {
                 let rval = &rcol[[rrow]];
                 oval.mul_acc(lval, rval);
             }
+            ocol[[orow]] = oval;
         }
     }
 }

For mul_add, we can test by replacing the default impl of MulAcc with the following:

--- a/src/mul_acc.rs
+++ b/src/mul_acc.rs
@@ -19,13 +19,13 @@ pub trait MulAcc<A = Self, B = A> {

 /// Default for types which supports `mul_add`
 impl<N, A, B> MulAcc<A, B> for N
-where
-    N: Copy,
-    B: Copy,
-    A: num_traits::MulAdd<B, N, Output = N> + Copy,
+    where
+        N: Copy,
+        B: std::ops::Add<N, Output = N> + Copy,
+        A: std::ops::Mul<B, Output = B> + Copy,
 {
     fn mul_acc(&mut self, a: &A, b: &B) {
-        *self = a.mul_add(*b, *self);
+        *self = (*a * *b) + *self;
     }
 }

With both patches, when I rerun cargo run --release scrappy_bench, the speed gap between old & new implementations is gone, if the new impl isn't a bit faster:

New     4217ms
Old     4305ms

New     72ms
Old     73ms

New     1126ms
Old     1135ms

New     347ms
Old     350ms

New     471ms
Old     473ms

New     2723ms
Old     2724ms

New     20ms
Old     20ms

New     4179ms
Old     4200ms

New     1677ms
Old     1685ms

New     1381ms
Old     1375ms

With only the second (mul_add) patch but without first, the speed gap will still be obvious:

New     1431ms
Old     417ms

New     927ms
Old     298ms

New     135ms
Old     38ms

New     1967ms
Old     577ms

New     1209ms
Old     335ms

New     2096ms
Old     556ms

New     3326ms
Old     1046ms

New     1360ms
Old     376ms

New     122ms
Old     34ms

New     986ms
Old     273ms

With only the first patch but without the second, the speed gap is also obvious:

New     923ms
Old     420ms

New     4243ms
Old     1977ms

New     335ms
Old     138ms

New     773ms
Old     308ms

New     3560ms
Old     1778ms

New     2ms
Old     1ms

New     1334ms
Old     623ms

New     243ms
Old     98ms

New     341ms
Old     161ms

New     239ms
Old     92ms

but if I turn on target-cpu=native to speed up mul_add operations by running env RUSTFLAGS='-C target-cpu=native' cargo run --release scrappy_bench, we see (on my machine at least) the gap shrinks significantly although it doesn't disappear completely:

New     127ms
Old     104ms

New     93ms
Old     81ms

New     45ms
Old     46ms

New     578ms
Old     504ms

New     2250ms
Old     1769ms

New     69ms
Old     58ms

New     101ms
Old     89ms

New     1343ms
Old     1107ms

New     1419ms
Old     1230ms

New     76ms
Old     73ms

Ok so where do we go from here?

  • My guess is that the mul_add issue could possibly be circumvented by doing conditional compilation so that we only use the mul_add operation when the target CPUs support it or when user explicitly enables it.
  • I'm not sure about how to balance between avoiding .clone() v.s. avoiding large numbers of repeated cache-unfriendly updates to heap data though: this tension seems inherent to the abstraction design of forgoing the assumption that elements of a matrix are cheaply copyable.
aujxn commented

@tomtung Is there any reason other than the potential efficiency improvement of using mul_add? It seems from your benchmarks that the improvement isn't present on your machine and I'm getting similar results as well. It doesn't seem like the additional complexity of conditional compilation would be worth it here unless there are other benefits I am missing.

For the second issue, I don't mind the Copy trait requirement but I could imagine a case where this might be restrictive. I will test if having the explicit clone() is worse than with the Copy requirement. Cloning the value to the stack might be worse if the matrix is sparse enough that few operations are happening on each item or when the type is large. I would expect these conditions are less common for the users, the matrices I'm using are mostly from finite elements.

It still surprises me that the repeated accesses to the heap are hurting cache performance since the accesses are to the same location each time. I guess the alternating access between the rhs and out array views is likely where the cache is getting wiped but valgrind and iai can probably answer this for sure.

aujxn commented

So I ran my own bench with criterion on a single csr by dense matrix product with 6 variations. The first 3 are heap access, clone(), and Copy; all with mul_acc. The second three are replacing mul_add with the code you tried in the MulAcc trait.

csr * dense (column major)/heap access                                                                            
                        time:   [21.416 ms 21.629 ms 22.209 ms]
                        change: [-1.3591% +3.3931% +8.8655%] (p = 0.23 > 0.05)
                        No change in performance detected.
csr * dense (column major)/cloned                                                                            
                        time:   [21.637 ms 22.171 ms 22.537 ms]
                        change: [-0.0919% +1.3130% +2.9281%] (p = 0.11 > 0.05)
                        No change in performance detected.
csr * dense (column major)/copied                                                                            
                        time:   [21.665 ms 21.981 ms 22.282 ms]
                        change: [-5.2071% -2.9865% -0.8159%] (p = 0.02 < 0.05)
                        Change within noise threshold.
csr * dense (column major)/heap access no mul_add                                                                            
                        time:   [12.589 ms 12.637 ms 12.665 ms]
                        change: [+2.5819% +5.0996% +7.7114%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 10 measurements (40.00%)
  1 (10.00%) low severe
  1 (10.00%) low mild
  2 (20.00%) high severe
csr * dense (column major)/cloned no mul_add                                                                            
                        time:   [11.931 ms 12.333 ms 12.707 ms]
                        change: [-1.9158% -0.3659% +1.8278%] (p = 0.77 > 0.05)
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe
csr * dense (column major)/copied no mul_add                                                                            
                        time:   [11.859 ms 12.014 ms 12.132 ms]
                        change: [-5.7219% -4.9127% -4.2154%] (p = 0.00 < 0.05)
                        Performance has improved.

On my environment I hardly any difference between heap, copy, and clone but I do get an improvement with removing the mul_add. I did not enable target-cpu=native or lto for this bench, though. Also, the sparse matrix I tested might be significantly more sparse than your random ones.

aujxn commented

I added benches that are exactly the same as yours and am getting the same result as with my FEM matrix. Hardly any difference between heap access, clones, and copies on my environment.

To see the changes I made look at this commit in my fork.

Hi @aujxn, thanks for looking into this! Sorry about the delay; I've been a bit busy at work lately.

I cloned your fork and ran the benchmark. The full result is attached below (with added blank lines for grouping). Some observations:

  • As you said, little difference between heap access v.s. cloned/copied can be observed for FEM matrix
    • Well strictly speaking, for no mul_add, heap access took ~20% more time but this difference is much milder than I observed previously
  • The same is largely true for other random matrices if mul_add is used
  • But if for no mul_add, the "heap access" implementation is consistently taking ~2x time, which is basically consistent with what I observed previously.

Curious if the result is different on your machine.

csr * dense (column major)/heap access/FEM matrix
                        time:   [55.487 ms 56.744 ms 58.344 ms]
csr * dense (column major)/cloned/FEM matrix
                        time:   [53.276 ms 59.039 ms 69.705 ms]
csr * dense (column major)/copied/FEM matrix
                        time:   [50.667 ms 53.405 ms 55.232 ms]

csr * dense (column major)/heap access no mul_add/FEM matrix
                        time:   [36.970 ms 37.613 ms 38.236 ms]
csr * dense (column major)/cloned no mul_add/FEM matrix
                        time:   [29.434 ms 29.827 ms 30.294 ms]
csr * dense (column major)/copied no mul_add/FEM matrix
                        time:   [29.478 ms 30.060 ms 30.655 ms]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild


csr * dense (column major)/heap access/159x1893 times 1893x146
                        time:   [45.787 ms 46.982 ms 48.124 ms]
csr * dense (column major)/cloned/159x1893 times 1893x146
                        time:   [37.908 ms 39.056 ms 40.246 ms]
csr * dense (column major)/copied/159x1893 times 1893x146
                        time:   [37.958 ms 39.063 ms 40.114 ms]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

csr * dense (column major)/heap access no mul_add/159x1893 times 1893x146
                        time:   [41.339 ms 41.724 ms 42.164 ms]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
csr * dense (column major)/cloned no mul_add/159x1893 times 1893x146
                        time:   [19.485 ms 19.791 ms 20.418 ms]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe
csr * dense (column major)/copied no mul_add/159x1893 times 1893x146
                        time:   [19.816 ms 20.136 ms 20.435 ms]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild


csr * dense (column major)/heap access/78x3324 times 3324x65
                        time:   [5.3216 ms 5.4767 ms 5.7249 ms]
csr * dense (column major)/cloned/78x3324 times 3324x65
                        time:   [4.7937 ms 4.9362 ms 5.0684 ms]
csr * dense (column major)/copied/78x3324 times 3324x65
                        time:   [4.8170 ms 4.9555 ms 5.1752 ms]

csr * dense (column major)/heap access no mul_add/78x3324 times 3324x65
                        time:   [4.8502 ms 4.8742 ms 4.8950 ms]
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) low mild
  1 (10.00%) high severe
csr * dense (column major)/cloned no mul_add/78x3324 times 3324x65
                        time:   [2.3820 ms 2.4123 ms 2.4575 ms]
Found 2 outliers among 10 measurements (20.00%)
  2 (20.00%) high mild
csr * dense (column major)/copied no mul_add/78x3324 times 3324x65
                        time:   [2.3544 ms 2.4325 ms 2.4874 ms]


csr * dense (column major)/heap access/142x2841 times 2841x190
                        time:   [81.660 ms 84.001 ms 86.275 ms]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
csr * dense (column major)/cloned/142x2841 times 2841x190
                        time:   [70.567 ms 72.336 ms 73.977 ms]
csr * dense (column major)/copied/142x2841 times 2841x190
                        time:   [69.629 ms 72.524 ms 77.040 ms]

csr * dense (column major)/heap access no mul_add/142x2841 times 2841x190
                        time:   [73.203 ms 73.561 ms 74.019 ms]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
csr * dense (column major)/cloned no mul_add/142x2841 times 2841x190
                        time:   [34.449 ms 35.334 ms 35.954 ms]
csr * dense (column major)/copied no mul_add/142x2841 times 2841x190
                        time:   [35.703 ms 36.066 ms 36.496 ms]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild


csr * dense (column major)/heap access/79x939 times 939x163
                        time:   [3.6998 ms 3.8007 ms 3.9015 ms]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
csr * dense (column major)/cloned/79x939 times 939x163
                        time:   [2.8562 ms 2.9331 ms 3.0137 ms]
csr * dense (column major)/copied/79x939 times 939x163
                        time:   [2.7363 ms 2.7931 ms 2.8802 ms]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

csr * dense (column major)/heap access no mul_add/79x939 times 939x163
                        time:   [3.3228 ms 3.4134 ms 3.5378 ms]
csr * dense (column major)/cloned no mul_add/79x939 times 939x163
                        time:   [1.5729 ms 1.6007 ms 1.6306 ms]
csr * dense (column major)/copied no mul_add/79x939 times 939x163
                        time:   [1.5662 ms 1.5872 ms 1.6092 ms]
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) low mild
  1 (10.00%) high severe


csr * dense (column major)/heap access/91x2643 times 2643x10
                        time:   [756.51 us 769.47 us 791.66 us]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe
csr * dense (column major)/cloned/91x2643 times 2643x10
                        time:   [578.17 us 596.17 us 611.39 us]
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) high mild
  1 (10.00%) high severe
csr * dense (column major)/copied/91x2643 times 2643x10
                        time:   [593.18 us 616.22 us 630.58 us]

csr * dense (column major)/heap access no mul_add/91x2643 times 2643x10
                        time:   [675.60 us 678.95 us 682.22 us]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
csr * dense (column major)/cloned no mul_add/91x2643 times 2643x10
                        time:   [324.66 us 329.59 us 336.12 us]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe
csr * dense (column major)/copied no mul_add/91x2643 times 2643x10
                        time:   [321.06 us 328.59 us 335.53 us]


csr * dense (column major)/heap access/161x4040 times 4040x136
                        time:   [29.477 ms 30.191 ms 31.292 ms]
csr * dense (column major)/cloned/161x4040 times 4040x136
                        time:   [26.014 ms 26.671 ms 27.298 ms]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
csr * dense (column major)/copied/161x4040 times 4040x136
                        time:   [26.020 ms 26.533 ms 27.227 ms]

csr * dense (column major)/heap access no mul_add/161x4040 times 4040x136
                        time:   [25.860 ms 26.068 ms 26.440 ms]
csr * dense (column major)/cloned no mul_add/161x4040 times 4040x136
                        time:   [13.061 ms 13.251 ms 13.436 ms]
csr * dense (column major)/copied no mul_add/161x4040 times 4040x136
                        time:   [12.926 ms 13.189 ms 13.464 ms]


csr * dense (column major)/heap access/147x1918 times 1918x173
                        time:   [32.647 ms 33.848 ms 34.933 ms]
csr * dense (column major)/cloned/147x1918 times 1918x173
                        time:   [25.215 ms 25.939 ms 26.671 ms]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
csr * dense (column major)/copied/147x1918 times 1918x173
                        time:   [25.740 ms 26.781 ms 27.521 ms]

csr * dense (column major)/heap access no mul_add/147x1918 times 1918x173
                        time:   [29.217 ms 29.415 ms 29.578 ms]
csr * dense (column major)/cloned no mul_add/147x1918 times 1918x173
                        time:   [13.961 ms 14.110 ms 14.298 ms]
csr * dense (column major)/copied no mul_add/147x1918 times 1918x173
                        time:   [13.911 ms 14.071 ms 14.260 ms]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild


csr * dense (column major)/heap access/72x1983 times 1983x143
                        time:   [13.452 ms 13.595 ms 13.738 ms]
csr * dense (column major)/cloned/72x1983 times 1983x143
                        time:   [11.679 ms 12.364 ms 13.329 ms]
csr * dense (column major)/copied/72x1983 times 1983x143
                        time:   [11.480 ms 11.744 ms 12.336 ms]

csr * dense (column major)/heap access no mul_add/72x1983 times 1983x143
                        time:   [12.315 ms 12.384 ms 12.459 ms]
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) low mild
  1 (10.00%) high mild
csr * dense (column major)/cloned no mul_add/72x1983 times 1983x143
                        time:   [5.9033 ms 6.0113 ms 6.1852 ms]
csr * dense (column major)/copied no mul_add/72x1983 times 1983x143
                        time:   [5.8579 ms 5.9128 ms 5.9835 ms]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild


csr * dense (column major)/heap access/138x669 times 669x34
                        time:   [955.18 us 973.95 us 992.85 us]
csr * dense (column major)/cloned/138x669 times 669x34                                                                            https://github.com/aujxn/sprs.git
                        time:   [639.57 us 702.86 us 739.42 us]
csr * dense (column major)/copied/138x669 times 669x34
                        time:   [692.00 us 705.59 us 726.33 us]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

csr * dense (column major)/heap access no mul_add/138x669 times 669x34
                        time:   [813.84 us 818.37 us 824.42 us]
csr * dense (column major)/cloned no mul_add/138x669 times 669x34
                        time:   [377.44 us 382.25 us 386.91 us]
csr * dense (column major)/copied no mul_add/138x669 times 669x34
                        time:   [374.05 us 380.29 us 387.93 us]


csr * dense (column major)/heap access/181x1896 times 1896x17
                        time:   [1.7983 ms 1.8631 ms 1.9685 ms]
csr * dense (column major)/cloned/181x1896 times 1896x17
                        time:   [1.4569 ms 1.5140 ms 1.5575 ms]
csr * dense (column major)/copied/181x1896 times 1896x17
                        time:   [1.4095 ms 1.4471 ms 1.5049 ms]

csr * dense (column major)/heap access no mul_add/181x1896 times 1896x17
                        time:   [1.6222 ms 1.6331 ms 1.6474 ms]
csr * dense (column major)/cloned no mul_add/181x1896 times 1896x17
                        time:   [781.03 us 794.81 us 806.94 us]
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) low mild
  1 (10.00%) high mild
csr * dense (column major)/copied no mul_add/181x1896 times 1896x17
                        time:   [787.81 us 798.05 us 808.71 us]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

Is there any reason other than the potential efficiency improvement of using mul_add?

I guess since mul_add is performed in one step with a single rounding, while doing mul then add rounds twice, one benefit of the former would be the better precision it affords. Although for my use case (machine learning) I don't really care, maybe it's beneficial for other applications?