Faer rs Backend
Closed this issue · 4 comments
Hi,
Are you aware of Faer-rs crate ? It has very strong performance, on par with openblas, intel-mkl, without dealing with c/c++ bindings.
It should provide great performance boost for appropriate fundamental operations, e.g. complex gemm (conjugate or not)
One can start with including it slowly as an optional backend for some ops.
Thanks for the interest. I actually don't use raw matrix multiplication for the bulk of the backend calculations. That's because most linalg libraries don't know how to exploit the tensor product structure of quantum circuits. By knowing how the matrices are constructed, as a tensor product of smaller matrices, we never need to construct the full matrix and can skip entries we know will be 0. This is faster even than sparse representations.
I should set up some benchmarks to explicitly compare this though.
Yeah, I have not checked the codebase in super detail. Just tossing around this idea in case some of the ops can be formulated in a way that is relevant to faer rs
Added some benchmarks across a few backends.
Here I'm just testing a single 2x2 matrix applied to qubit 0 and identity applied elsewhere -- most benchmarks are on 12 qubits, the "large" ones are on 15 qubits.
I test 4 backends: qip
(this library), ndarray
, sprs
(sparse matrices), and faer
.
test tests::bench_ones_faer_reuse_arena ... bench: 6,446,267 ns/iter (+/- 47,841)
test tests::bench_ones_ndarray_build_each ... bench: 142,738,247 ns/iter (+/- 980,752)
test tests::bench_ones_ndarray_reuse ... bench: 4,600,275 ns/iter (+/- 404,325)
test tests::bench_ones_ndarray_reuse_arena ... bench: 4,604,027 ns/iter (+/- 562,828)
test tests::bench_ones_qip ... bench: 58,430 ns/iter (+/- 1,756)
test tests::bench_ones_qip_large ... bench: 1,289,297 ns/iter (+/- 23,239)
test tests::bench_ones_sprs_build_each ... bench: 103,473 ns/iter (+/- 637)
test tests::bench_ones_sprs_build_each_large ... bench: 38,890,543 ns/iter (+/- 616,850)
test tests::bench_ones_sprs_reuse ... bench: 17,702 ns/iter (+/- 205)
test tests::bench_ones_sprs_reuse_large ... bench: 5,318,135 ns/iter (+/- 47,856)
It looks like faer isn't performing very well compared to the others so I'm worried I didn't write the benchmark properly, if you're familiar with the library I'm open to changes.
Followup: After talking with the faer devs on their discord it seems it isn't really the best choice for backend. Their matrix on vector support is not prioritized and performs no better than using blas with 1 thread. I'll close this for now but keep the bench around.