slow running in arm
ditannan opened this issue · 2 comments
I install the fast-ctc-decode in arm environment through:
$ git clone https://github.com/nanoporetech/fast-ctc-decode.git
$ cd fast-ctc-decode
$ pip install --user maturin
$ make test
and the rust version: rustc 1.46.0-nightly (0ca7f74db 2020-06-29);
maturin version: maturin 0.8.2
I run the folling code in arm and x86 environment. And the results show beam search and vitebi search in arm are much slower than those in x86, more than 40 times for beam search and 60 times for vitebi search.
from time import time
import numpy as np
from fast_ctc_decode import beam_search, viterbi_search
l = 1000
alphabet = "NACGT"
loop = 1000
posteriors = np.random.rand(l, len(alphabet)).astype(np.float32)
start_time = time()
for lp in range(loop):
seq, path = beam_search(posteriors, alphabet, beam_size=5, beam_cut_threshold=0.1)
end_time = time()
cost = end_time - start_time
print(f"beam search length of {l} for {loop} times cost: {cost}s, mean: {cost/loop}")
start_time = time()
for lp in range(loop):
seq, path = viterbi_search(posteriors, alphabet)
end_time = time()
cost = end_time - start_time
print(f"viterbi search length of {l} for {loop} times cost: {cost}s, mean: {cost/loop}")
The results:
Arm:
beam search length of 1000 for 1000 times cost: 63.2727427482605s, mean: 0.0632727427482605
viterbi search length of 1000 for 1000 times cost: 2.8806848526000977s, mean: 0.002880684852600098
X86:
beam search length of 1000 for 1000 times cost: 1.4413361549377441s, mean: 0.0014413361549377442
viterbi search length of 1000 for 1000 times cost: 0.04563140869140625s, mean: 4.563140869140625e-05
So why is so slow in arm? How to optimize the speed in arm? And do you have a compiled arm package? Thanks a lot.
I haven’t benchmarked on arm myself but the differences you are reporting are much larger than I would expect. Is possible that you are using a debug build on arm and release build on x86? make test
will result in debug build, can you rebuild and benchmark again after doing make clean && make build
.
I haven’t benchmarked on arm myself but the differences you are reporting are much larger than I would expect. Is possible that you are using a debug build on arm and release build on x86?
make test
will result in debug build, can you rebuild and benchmark again after doingmake clean && make build
.
Thank you very much, I rebuilt the package with make clean && make build
and got a wheel file named target/wheels/fast_ctc_decode-0.2.5-cp37-cp37m-manylinux1_aarch64.whl
. I use pip3 install target/wheels/fast_ctc_decode-0.2.5-cp37-cp37m-manylinux1_aarch64.whl
and got fast_ctc_decode-0.2.5-cp37-cp37m-manylinux1_aarch64.whl is not a supported wheel on this platform.
error. After renamed the file to fast_ctc_decode-0.2.5-cp37-cp37m-linux_aarch64.whl
, I installed the package successfully. And the new benchmark on arm:
beam search length of 1000 for 1000 times cost: 1.6493666172027588s, mean: 0.0016493666172027588
viterbi search length of 1000 for 1000 times cost: 0.06065034866333008s, mean: 6.065034866333008e-05
The time differences are small, thanks a lot.