`BitXorAssign<&BitSlice>` is unexpectedly slow.
JStarx opened this issue · 2 comments
The following function considers a Vec<Vec<usize>>
as a binary matrix with bits packed into the usize
and reduces the first row of that matrix (meant to be representative of the type of operations you'd do on matrices in general).
pub fn reduce1(mat: &mut Vec<Vec<usize>>) {
let (first, rest) = mat.split_first_mut().unwrap();
for row in rest {
if row[0] & 0x1 == 1 {
for (x, y) in row.iter_mut().zip(first.iter()) {
*x ^= *y;
}
}
}
}
If we instead want to represent the rows of our matrix as BitVec
's this would look like the following.
pub fn reduce2(mat: &mut Vec<BitVec<usize, Lsb0>>) {
let (first, rest) = mat.split_first_mut().unwrap();
for row in rest {
if row[0] {
*row ^= &*first;
}
}
}
Maybe I am misunderstanding what happens under the hood but I'd expect the second function to be doing exactly what the first is with not a whole lot of overhead. But in fact reduce2
is like 40x slower. On a 1500 x 5500
ish size matrix the benchmarks give 29.481μs
and 1.1134 ms
respectively.
There are a lot of benefits elsewhere to having my data in a BitVec
but I need it to be fast. I can play tricks with converting in and out of a Vec
in order to recover speed, for example the benchmark of
pub fn reduce3(mat: &mut Vec<BitVec<usize, Lsb0>>) {
let ncols = mat[0].len();
let first = std::mem::take(&mut mat[0]).into_vec();
for row in &mut mat[1..] {
if row[0] {
let mut vrow = std::mem::take(row).into_vec();
for (x, y) in vrow.iter_mut().zip(first.iter()) {
*x ^= y;
}
*row = BitVec::<usize, Lsb0>::from_vec(vrow);
row.truncate(ncols);
}
}
mat[0] = BitVec::<usize, Lsb0>::from_vec(first);
mat[0].truncate(ncols);
}
is 29.236 µs
, statistically equivalent to the first. It's what I'm currently using, but it's a little ugly. Is there a better way to get fast row operations on BitVec
's?
The bit-wise operators are a nightmare to implement, and in the absence of specialization I am forced to choose between either "semantically useful, but horribly slow" or "fast, but rapidly becomes a compiler error".
You could bypass the BitVec -> Vec -> BitVec
conversions by using the .domain_mut()
/.domain()
views to do essentially the algorithm you have here. As long as your matrix is not jagged (and each row is > 64 bits), you can use
mat.iter_mut().for_each(|bv| bv.force_align());
let (first, rest) = mat.split_at_mut(1);
let (_, f_elts, f_tail) = first.domain().region().unwrap();
// snip
let (_, r_elts, r_tail) = row.domain_mut().region().unwrap();
Here, the *_elts
names are &/mut [usize]
and the *_tail
names are guarded references to usize
. You can walk the slices just like you are in the loop, but you'd need to write something like
r_tail.store_value(r_tail.load_value() ^ f_tail.load_value());
to deal with the final partially-used word.
This is ... not convenient, but it avoids the type conversion and re-truncation you are doing, and should cause the same codegen you already have. I don't really have a good answer here; I'm always looking to improve performance here as much as I can. This example should be causing the specialization hack I implemented (dispatched here, resolved here) to kick in, so I am not sure offhand what could be causing the slowdown. I would guess that your bit-vectors aren't misaligned, so the usize
load/store behavior in the specialized function shouldn't be slow either...
Honestly a lot of my performance problems come from the fact that BitVec
doesn't require that bit [0]
be placed at the actual zero-index of the storage region. I will look into whether enforcing that as a requirement can be used to cause speedups here, since BitVec
and BitBox
would thus be always known to have better memory access properties.
Thank you for filing this and I apologize for the slow response. Please let me know if the Domain
views are at all helpful to you, and I'll work on improving codegen in the specialized case.