`MulAcc` causes significant speed regression
tomtung opened this issue · 9 comments
I just found out that my implementation of an ML model has been training up to several times slower than before. After some debugging, turns out that the cause was #276, where the MulAcc
trait was introduced in d56bd72, and specifically where the Copy
bound was removed in d2f0da7. For details please see tomtung/omikuji#32.
My intuitive guess is that the trait MulAcc
got rid of the Copy
bound at the cost of forcing the values to be passed as references (pointers). This might have made it hard for the compiler to do inlining and/or other optimizations. since the +=
operation is typically in the innermost part of the loop, any slow-down in such hot spots would result in significant speed regression.
I don't have context on the introduction of MulAcc
, so I'm not sure what's the right way to fix this. Let me know if there's something I could help with.
I tried confirming the speed regression with a quick & scrappy benchmark.
I first cloned sprs locally, and patch it to add back the previous csr_mulacc_dense_colmaj
implementation with a suffix _old
:
--- a/src/sparse/prod.rs
+++ b/src/sparse/prod.rs
@@ -295,6 +295,46 @@ pub fn csc_mulacc_dense_colmaj<'a, N, A, B, I, Iptr>(
}
}
+/// CSR-dense colmaj multiplication
+///
+/// Performs better if rhs has few columns.
+pub fn csr_mulacc_dense_colmaj_old<'a, N1, N2, NOut, I, Iptr>(
+ lhs: CsMatViewI<N1, I, Iptr>,
+ rhs: ArrayView<N2, Ix2>,
+ mut out: ArrayViewMut<'a, NOut, Ix2>,
+) where
+ N1: 'a + Num + Copy,
+ N2: 'a + Num + Copy,
+ NOut: 'a + Num + Copy,
+ N1: std::ops::Mul<N2, Output = NOut>,
+ I: 'a + SpIndex,
+ Iptr: 'a + SpIndex,
+{
+ if lhs.cols() != rhs.shape()[0] {
+ panic!("Dimension mismatch");
+ }
+ if lhs.rows() != out.shape()[0] {
+ panic!("Dimension mismatch");
+ }
+ if rhs.shape()[1] != out.shape()[1] {
+ panic!("Dimension mismatch");
+ }
+ if !lhs.is_csr() {
+ panic!("Storage mismatch");
+ }
+ let axis1 = Axis(1);
+ for (mut ocol, rcol) in out.axis_iter_mut(axis1).zip(rhs.axis_iter(axis1)) {
+ for (orow, lrow) in lhs.outer_iterator().enumerate() {
+ let mut prev = ocol[[orow]];
+ for (rrow, &lval) in lrow.iter() {
+ let rval = rcol[[rrow]];
+ prev = prev + lval * rval;
+ }
+ ocol[[orow]] = prev;
+ }
+ }
+}
+
/// CSR-dense colmaj multiplication
///
/// Performs better if rhs has few columns.
then I modified Cargo.toml
a bit
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -29,6 +29,7 @@ smallvec = "1.4.0"
rayon = { version = "1.3.0", optional = true }
num_cpus = { version = "1.13.0", optional = true }
approx = { version = "0.5", optional = true }
+rand = { version = "0.8", default-features = false, features = ["small_rng"] }
[dev-dependencies]
num-derive = "0.3"
@@ -37,7 +38,6 @@ tempfile = "3.1.0"
bincode = "1.2.0"
tobj = "3.0"
image = { version = "0.23.0", default-features = false, features = ["png"] }
-rand = { version = "0.8", default-features = false, features = ["small_rng"] }
[[bench]]
name = "suite"
@@ -51,6 +51,11 @@ harness = false
name = "sorting"
harness = false
+[[bin]]
+name = "scrappy_bench"
+path = "src/scrappy_bench.rs"
+
+
[workspace]
members = [
"sprs-ldl",
Finally, I wrote the following simple benchmark program to src/scrappy_bench.rs
:
extern crate rand;
extern crate sprs;
extern crate ndarray;
use rand::Rng;
use sprs::prod::{csr_mulacc_dense_colmaj, csr_mulacc_dense_colmaj_old};
fn generate_random_csr(shape: sprs::Shape) -> sprs::CsMat<f32> {
let mut rng = rand::thread_rng();
let (rows, cols) = shape;
let nnz = rng.gen_range(1..rows * cols / 2);
let mut mat = sprs::TriMat::<f32>::new(shape);
for _ in 0..nnz {
let r = rng.gen_range(0..rows);
let c = rng.gen_range(0..cols);
let v = rng.gen::<f32>();
mat.add_triplet(r, c, v);
}
mat.to_csr()
}
fn generate_random_array(shape: sprs::Shape) -> ndarray::Array2<f32> {
let mut rng = rand::thread_rng();
let data = (0..shape.0 * shape.1).map(|_| rng.gen()).collect::<Vec<f32>>();
ndarray::Array2::from_shape_vec(shape, data).unwrap()
}
fn main() {
const N_RUNS_PER_LOOP: usize = 1000;
const N_LOOP: usize = 10;
const MAX_LEFT_DIM: usize = 200;
const MAX_RIGHT_DIM: usize = 200;
const MAX_INNER_DIM: usize = 5000;
let mut rng = rand::thread_rng();
for _ in 0..N_LOOP {
let left_shape: sprs::Shape = (
rng.gen_range(10..=MAX_LEFT_DIM),
rng.gen_range(10..=MAX_INNER_DIM),
);
let right_shape: sprs::Shape = (
left_shape.1,
rng.gen_range(1..=MAX_RIGHT_DIM),
);
let rand_csr = generate_random_csr(left_shape);
let rand_vec = generate_random_array(right_shape);
let mut out = ndarray::Array2::<f32>::zeros((left_shape.0, right_shape.1));
let start_t = std::time::Instant::now();
for _ in 0..N_RUNS_PER_LOOP {
csr_mulacc_dense_colmaj(rand_csr.view(), rand_vec.view(), out.view_mut());
}
println!("New\t{}ms", start_t.elapsed().as_millis());
let start_t = std::time::Instant::now();
for _ in 0..N_RUNS_PER_LOOP {
csr_mulacc_dense_colmaj_old(rand_csr.view(), rand_vec.view(), out.view_mut());
}
println!("Old\t{}ms", start_t.elapsed().as_millis());
println!("");
}
}
The benchmark can then be run by cargo run --release scrappy_bench
. Here's the result of one such run:
New 4846ms
Old 1874ms
New 450ms
Old 166ms
New 12071ms
Old 4460ms
New 1651ms
Old 581ms
New 60298ms
Old 38835ms
New 61863ms
Old 21512ms
New 7074ms
Old 2528ms
New 32ms
Old 10ms
New 5847ms
Old 2168ms
New 12322ms
Old 4290ms
As you can see, the new implementation with MulAcc
is indeed significantly slower than the old one.
Do the numbers change significantly if you enable lto
?
Yep, the difference seems to widen. With:
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -29,6 +29,7 @@ smallvec = "1.4.0"
rayon = { version = "1.3.0", optional = true }
num_cpus = { version = "1.13.0", optional = true }
approx = { version = "0.5", optional = true }
+rand = { version = "0.8", default-features = false, features = ["small_rng"] }
[dev-dependencies]
num-derive = "0.3"
@@ -37,7 +38,6 @@ tempfile = "3.1.0"
bincode = "1.2.0"
tobj = "3.0"
image = { version = "0.23.0", default-features = false, features = ["png"] }
-rand = { version = "0.8", default-features = false, features = ["small_rng"] }
[[bench]]
name = "suite"
@@ -51,6 +51,14 @@ harness = false
name = "sorting"
harness = false
+[[bin]]
+name = "scrappy_bench"
+path = "src/scrappy_bench.rs"
+
+[profile.release]
+lto = true
+
+
[workspace]
members = [
"sprs-ldl",
I get:
New 2410ms
Old 730ms
New 2870ms
Old 743ms
New 197ms
Old 62ms
New 43ms
Old 17ms
New 230ms
Old 59ms
New 4290ms
Old 1259ms
New 5172ms
Old 1578ms
New 13772ms
Old 3595ms
New 3718ms
Old 1950ms
New 7558ms
Old 2154ms
So I had a bit more fun debugging the cause of the speed regression. This is a tricky one because there turned out to be 2 causes for the slowdown:
- Change in memory access pattern (likely to be less cache-friendly)
mul_add
being slower than a separate multiplication operation followed by an add (see rust-lang/rust#49842)
For the former, I'm referring to the change here, where we used to copy the value of ocol[[orow]]
to a local variable prev
, accumulate all updates in the local variable, and copy back the new value only once when everything is done. The new implementation directly accumulates updates to ocol[[orow]]
to avoid the potentially expensive .clone()
. My guess is that constantly reading then immediately updating data that lives on the heap like ocol[[orow]]
is a lot less cache-friendly than doing the same for a local variable that lives on the stack.
For testing purposes, we can change it back to the old behavior at the cost of adding an explicit .clone()
:
diff --git a/src/sparse/prod.rs b/src/sparse/prod.rs
index aa7241b1..8c721474 100644
--- a/src/sparse/prod.rs
+++ b/src/sparse/prod.rs
@@ -303,7 +343,7 @@ pub fn csr_mulacc_dense_colmaj<'a, N, A, B, I, Iptr>(
rhs: ArrayView<B, Ix2>,
mut out: ArrayViewMut<'a, N, Ix2>,
) where
- N: 'a + crate::MulAcc<A, B>,
+ N: 'a + crate::MulAcc<A, B> + Clone,
I: 'a + SpIndex,
Iptr: 'a + SpIndex,
{
@@ -322,11 +362,12 @@ pub fn csr_mulacc_dense_colmaj<'a, N, A, B, I, Iptr>(
let axis1 = Axis(1);
for (mut ocol, rcol) in out.axis_iter_mut(axis1).zip(rhs.axis_iter(axis1)) {
for (orow, lrow) in lhs.outer_iterator().enumerate() {
- let oval = &mut ocol[[orow]];
+ let mut oval = ocol[[orow]].clone();
for (rrow, lval) in lrow.iter() {
let rval = &rcol[[rrow]];
oval.mul_acc(lval, rval);
}
+ ocol[[orow]] = oval;
}
}
}
For mul_add
, we can test by replacing the default impl of MulAcc
with the following:
--- a/src/mul_acc.rs
+++ b/src/mul_acc.rs
@@ -19,13 +19,13 @@ pub trait MulAcc<A = Self, B = A> {
/// Default for types which supports `mul_add`
impl<N, A, B> MulAcc<A, B> for N
-where
- N: Copy,
- B: Copy,
- A: num_traits::MulAdd<B, N, Output = N> + Copy,
+ where
+ N: Copy,
+ B: std::ops::Add<N, Output = N> + Copy,
+ A: std::ops::Mul<B, Output = B> + Copy,
{
fn mul_acc(&mut self, a: &A, b: &B) {
- *self = a.mul_add(*b, *self);
+ *self = (*a * *b) + *self;
}
}
With both patches, when I rerun cargo run --release scrappy_bench
, the speed gap between old & new implementations is gone, if the new impl isn't a bit faster:
New 4217ms
Old 4305ms
New 72ms
Old 73ms
New 1126ms
Old 1135ms
New 347ms
Old 350ms
New 471ms
Old 473ms
New 2723ms
Old 2724ms
New 20ms
Old 20ms
New 4179ms
Old 4200ms
New 1677ms
Old 1685ms
New 1381ms
Old 1375ms
With only the second (mul_add
) patch but without first, the speed gap will still be obvious:
New 1431ms
Old 417ms
New 927ms
Old 298ms
New 135ms
Old 38ms
New 1967ms
Old 577ms
New 1209ms
Old 335ms
New 2096ms
Old 556ms
New 3326ms
Old 1046ms
New 1360ms
Old 376ms
New 122ms
Old 34ms
New 986ms
Old 273ms
With only the first patch but without the second, the speed gap is also obvious:
New 923ms
Old 420ms
New 4243ms
Old 1977ms
New 335ms
Old 138ms
New 773ms
Old 308ms
New 3560ms
Old 1778ms
New 2ms
Old 1ms
New 1334ms
Old 623ms
New 243ms
Old 98ms
New 341ms
Old 161ms
New 239ms
Old 92ms
but if I turn on target-cpu=native
to speed up mul_add
operations by running env RUSTFLAGS='-C target-cpu=native' cargo run --release scrappy_bench
, we see (on my machine at least) the gap shrinks significantly although it doesn't disappear completely:
New 127ms
Old 104ms
New 93ms
Old 81ms
New 45ms
Old 46ms
New 578ms
Old 504ms
New 2250ms
Old 1769ms
New 69ms
Old 58ms
New 101ms
Old 89ms
New 1343ms
Old 1107ms
New 1419ms
Old 1230ms
New 76ms
Old 73ms
Ok so where do we go from here?
- My guess is that the
mul_add
issue could possibly be circumvented by doing conditional compilation so that we only use themul_add
operation when the target CPUs support it or when user explicitly enables it. - I'm not sure about how to balance between avoiding
.clone()
v.s. avoiding large numbers of repeated cache-unfriendly updates to heap data though: this tension seems inherent to the abstraction design of forgoing the assumption that elements of a matrix are cheaply copyable.
@tomtung Is there any reason other than the potential efficiency improvement of using mul_add
? It seems from your benchmarks that the improvement isn't present on your machine and I'm getting similar results as well. It doesn't seem like the additional complexity of conditional compilation would be worth it here unless there are other benefits I am missing.
For the second issue, I don't mind the Copy
trait requirement but I could imagine a case where this might be restrictive. I will test if having the explicit clone()
is worse than with the Copy
requirement. Cloning the value to the stack might be worse if the matrix is sparse enough that few operations are happening on each item or when the type is large. I would expect these conditions are less common for the users, the matrices I'm using are mostly from finite elements.
It still surprises me that the repeated accesses to the heap are hurting cache performance since the accesses are to the same location each time. I guess the alternating access between the rhs
and out
array views is likely where the cache is getting wiped but valgrind
and iai
can probably answer this for sure.
So I ran my own bench with criterion
on a single csr by dense matrix product with 6 variations. The first 3 are heap access, clone()
, and Copy
; all with mul_acc
. The second three are replacing mul_add
with the code you tried in the MulAcc
trait.
csr * dense (column major)/heap access
time: [21.416 ms 21.629 ms 22.209 ms]
change: [-1.3591% +3.3931% +8.8655%] (p = 0.23 > 0.05)
No change in performance detected.
csr * dense (column major)/cloned
time: [21.637 ms 22.171 ms 22.537 ms]
change: [-0.0919% +1.3130% +2.9281%] (p = 0.11 > 0.05)
No change in performance detected.
csr * dense (column major)/copied
time: [21.665 ms 21.981 ms 22.282 ms]
change: [-5.2071% -2.9865% -0.8159%] (p = 0.02 < 0.05)
Change within noise threshold.
csr * dense (column major)/heap access no mul_add
time: [12.589 ms 12.637 ms 12.665 ms]
change: [+2.5819% +5.0996% +7.7114%] (p = 0.00 < 0.05)
Performance has regressed.
Found 4 outliers among 10 measurements (40.00%)
1 (10.00%) low severe
1 (10.00%) low mild
2 (20.00%) high severe
csr * dense (column major)/cloned no mul_add
time: [11.931 ms 12.333 ms 12.707 ms]
change: [-1.9158% -0.3659% +1.8278%] (p = 0.77 > 0.05)
No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high severe
csr * dense (column major)/copied no mul_add
time: [11.859 ms 12.014 ms 12.132 ms]
change: [-5.7219% -4.9127% -4.2154%] (p = 0.00 < 0.05)
Performance has improved.
On my environment I hardly any difference between heap, copy, and clone but I do get an improvement with removing the mul_add
. I did not enable target-cpu=native
or lto
for this bench, though. Also, the sparse matrix I tested might be significantly more sparse than your random ones.
I added benches that are exactly the same as yours and am getting the same result as with my FEM matrix. Hardly any difference between heap access, clones, and copies on my environment.
To see the changes I made look at this commit in my fork.
Hi @aujxn, thanks for looking into this! Sorry about the delay; I've been a bit busy at work lately.
I cloned your fork and ran the benchmark. The full result is attached below (with added blank lines for grouping). Some observations:
- As you said, little difference between heap access v.s. cloned/copied can be observed for FEM matrix
- Well strictly speaking, for no mul_add, heap access took ~20% more time but this difference is much milder than I observed previously
- The same is largely true for other random matrices if mul_add is used
- But if for no mul_add, the "heap access" implementation is consistently taking ~2x time, which is basically consistent with what I observed previously.
Curious if the result is different on your machine.
csr * dense (column major)/heap access/FEM matrix
time: [55.487 ms 56.744 ms 58.344 ms]
csr * dense (column major)/cloned/FEM matrix
time: [53.276 ms 59.039 ms 69.705 ms]
csr * dense (column major)/copied/FEM matrix
time: [50.667 ms 53.405 ms 55.232 ms]
csr * dense (column major)/heap access no mul_add/FEM matrix
time: [36.970 ms 37.613 ms 38.236 ms]
csr * dense (column major)/cloned no mul_add/FEM matrix
time: [29.434 ms 29.827 ms 30.294 ms]
csr * dense (column major)/copied no mul_add/FEM matrix
time: [29.478 ms 30.060 ms 30.655 ms]
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high mild
csr * dense (column major)/heap access/159x1893 times 1893x146
time: [45.787 ms 46.982 ms 48.124 ms]
csr * dense (column major)/cloned/159x1893 times 1893x146
time: [37.908 ms 39.056 ms 40.246 ms]
csr * dense (column major)/copied/159x1893 times 1893x146
time: [37.958 ms 39.063 ms 40.114 ms]
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high mild
csr * dense (column major)/heap access no mul_add/159x1893 times 1893x146
time: [41.339 ms 41.724 ms 42.164 ms]
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high mild
csr * dense (column major)/cloned no mul_add/159x1893 times 1893x146
time: [19.485 ms 19.791 ms 20.418 ms]
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high severe
csr * dense (column major)/copied no mul_add/159x1893 times 1893x146
time: [19.816 ms 20.136 ms 20.435 ms]
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high mild
csr * dense (column major)/heap access/78x3324 times 3324x65
time: [5.3216 ms 5.4767 ms 5.7249 ms]
csr * dense (column major)/cloned/78x3324 times 3324x65
time: [4.7937 ms 4.9362 ms 5.0684 ms]
csr * dense (column major)/copied/78x3324 times 3324x65
time: [4.8170 ms 4.9555 ms 5.1752 ms]
csr * dense (column major)/heap access no mul_add/78x3324 times 3324x65
time: [4.8502 ms 4.8742 ms 4.8950 ms]
Found 2 outliers among 10 measurements (20.00%)
1 (10.00%) low mild
1 (10.00%) high severe
csr * dense (column major)/cloned no mul_add/78x3324 times 3324x65
time: [2.3820 ms 2.4123 ms 2.4575 ms]
Found 2 outliers among 10 measurements (20.00%)
2 (20.00%) high mild
csr * dense (column major)/copied no mul_add/78x3324 times 3324x65
time: [2.3544 ms 2.4325 ms 2.4874 ms]
csr * dense (column major)/heap access/142x2841 times 2841x190
time: [81.660 ms 84.001 ms 86.275 ms]
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high mild
csr * dense (column major)/cloned/142x2841 times 2841x190
time: [70.567 ms 72.336 ms 73.977 ms]
csr * dense (column major)/copied/142x2841 times 2841x190
time: [69.629 ms 72.524 ms 77.040 ms]
csr * dense (column major)/heap access no mul_add/142x2841 times 2841x190
time: [73.203 ms 73.561 ms 74.019 ms]
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high mild
csr * dense (column major)/cloned no mul_add/142x2841 times 2841x190
time: [34.449 ms 35.334 ms 35.954 ms]
csr * dense (column major)/copied no mul_add/142x2841 times 2841x190
time: [35.703 ms 36.066 ms 36.496 ms]
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high mild
csr * dense (column major)/heap access/79x939 times 939x163
time: [3.6998 ms 3.8007 ms 3.9015 ms]
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high mild
csr * dense (column major)/cloned/79x939 times 939x163
time: [2.8562 ms 2.9331 ms 3.0137 ms]
csr * dense (column major)/copied/79x939 times 939x163
time: [2.7363 ms 2.7931 ms 2.8802 ms]
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high mild
csr * dense (column major)/heap access no mul_add/79x939 times 939x163
time: [3.3228 ms 3.4134 ms 3.5378 ms]
csr * dense (column major)/cloned no mul_add/79x939 times 939x163
time: [1.5729 ms 1.6007 ms 1.6306 ms]
csr * dense (column major)/copied no mul_add/79x939 times 939x163
time: [1.5662 ms 1.5872 ms 1.6092 ms]
Found 2 outliers among 10 measurements (20.00%)
1 (10.00%) low mild
1 (10.00%) high severe
csr * dense (column major)/heap access/91x2643 times 2643x10
time: [756.51 us 769.47 us 791.66 us]
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high severe
csr * dense (column major)/cloned/91x2643 times 2643x10
time: [578.17 us 596.17 us 611.39 us]
Found 2 outliers among 10 measurements (20.00%)
1 (10.00%) high mild
1 (10.00%) high severe
csr * dense (column major)/copied/91x2643 times 2643x10
time: [593.18 us 616.22 us 630.58 us]
csr * dense (column major)/heap access no mul_add/91x2643 times 2643x10
time: [675.60 us 678.95 us 682.22 us]
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high mild
csr * dense (column major)/cloned no mul_add/91x2643 times 2643x10
time: [324.66 us 329.59 us 336.12 us]
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high severe
csr * dense (column major)/copied no mul_add/91x2643 times 2643x10
time: [321.06 us 328.59 us 335.53 us]
csr * dense (column major)/heap access/161x4040 times 4040x136
time: [29.477 ms 30.191 ms 31.292 ms]
csr * dense (column major)/cloned/161x4040 times 4040x136
time: [26.014 ms 26.671 ms 27.298 ms]
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high mild
csr * dense (column major)/copied/161x4040 times 4040x136
time: [26.020 ms 26.533 ms 27.227 ms]
csr * dense (column major)/heap access no mul_add/161x4040 times 4040x136
time: [25.860 ms 26.068 ms 26.440 ms]
csr * dense (column major)/cloned no mul_add/161x4040 times 4040x136
time: [13.061 ms 13.251 ms 13.436 ms]
csr * dense (column major)/copied no mul_add/161x4040 times 4040x136
time: [12.926 ms 13.189 ms 13.464 ms]
csr * dense (column major)/heap access/147x1918 times 1918x173
time: [32.647 ms 33.848 ms 34.933 ms]
csr * dense (column major)/cloned/147x1918 times 1918x173
time: [25.215 ms 25.939 ms 26.671 ms]
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high mild
csr * dense (column major)/copied/147x1918 times 1918x173
time: [25.740 ms 26.781 ms 27.521 ms]
csr * dense (column major)/heap access no mul_add/147x1918 times 1918x173
time: [29.217 ms 29.415 ms 29.578 ms]
csr * dense (column major)/cloned no mul_add/147x1918 times 1918x173
time: [13.961 ms 14.110 ms 14.298 ms]
csr * dense (column major)/copied no mul_add/147x1918 times 1918x173
time: [13.911 ms 14.071 ms 14.260 ms]
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high mild
csr * dense (column major)/heap access/72x1983 times 1983x143
time: [13.452 ms 13.595 ms 13.738 ms]
csr * dense (column major)/cloned/72x1983 times 1983x143
time: [11.679 ms 12.364 ms 13.329 ms]
csr * dense (column major)/copied/72x1983 times 1983x143
time: [11.480 ms 11.744 ms 12.336 ms]
csr * dense (column major)/heap access no mul_add/72x1983 times 1983x143
time: [12.315 ms 12.384 ms 12.459 ms]
Found 2 outliers among 10 measurements (20.00%)
1 (10.00%) low mild
1 (10.00%) high mild
csr * dense (column major)/cloned no mul_add/72x1983 times 1983x143
time: [5.9033 ms 6.0113 ms 6.1852 ms]
csr * dense (column major)/copied no mul_add/72x1983 times 1983x143
time: [5.8579 ms 5.9128 ms 5.9835 ms]
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high mild
csr * dense (column major)/heap access/138x669 times 669x34
time: [955.18 us 973.95 us 992.85 us]
csr * dense (column major)/cloned/138x669 times 669x34 https://github.com/aujxn/sprs.git
time: [639.57 us 702.86 us 739.42 us]
csr * dense (column major)/copied/138x669 times 669x34
time: [692.00 us 705.59 us 726.33 us]
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high mild
csr * dense (column major)/heap access no mul_add/138x669 times 669x34
time: [813.84 us 818.37 us 824.42 us]
csr * dense (column major)/cloned no mul_add/138x669 times 669x34
time: [377.44 us 382.25 us 386.91 us]
csr * dense (column major)/copied no mul_add/138x669 times 669x34
time: [374.05 us 380.29 us 387.93 us]
csr * dense (column major)/heap access/181x1896 times 1896x17
time: [1.7983 ms 1.8631 ms 1.9685 ms]
csr * dense (column major)/cloned/181x1896 times 1896x17
time: [1.4569 ms 1.5140 ms 1.5575 ms]
csr * dense (column major)/copied/181x1896 times 1896x17
time: [1.4095 ms 1.4471 ms 1.5049 ms]
csr * dense (column major)/heap access no mul_add/181x1896 times 1896x17
time: [1.6222 ms 1.6331 ms 1.6474 ms]
csr * dense (column major)/cloned no mul_add/181x1896 times 1896x17
time: [781.03 us 794.81 us 806.94 us]
Found 2 outliers among 10 measurements (20.00%)
1 (10.00%) low mild
1 (10.00%) high mild
csr * dense (column major)/copied no mul_add/181x1896 times 1896x17
time: [787.81 us 798.05 us 808.71 us]
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high mild
Is there any reason other than the potential efficiency improvement of using mul_add?
I guess since mul_add
is performed in one step with a single rounding, while doing mul
then add
rounds twice, one benefit of the former would be the better precision it affords. Although for my use case (machine learning) I don't really care, maybe it's beneficial for other applications?