nanoporetech/fast-ctc-decode

slow running in arm

ditannan opened this issue · 2 comments

I install the fast-ctc-decode in arm environment through:
$ git clone https://github.com/nanoporetech/fast-ctc-decode.git
$ cd fast-ctc-decode
$ pip install --user maturin
$ make test

and the rust version: rustc 1.46.0-nightly (0ca7f74db 2020-06-29);
maturin version: maturin 0.8.2

I run the folling code in arm and x86 environment. And the results show beam search and vitebi search in arm are much slower than those in x86, more than 40 times for beam search and 60 times for vitebi search.

from time import time

import numpy as np
from fast_ctc_decode import beam_search, viterbi_search

l = 1000
alphabet = "NACGT"
loop = 1000

posteriors = np.random.rand(l, len(alphabet)).astype(np.float32)

start_time = time()
for lp in range(loop):
    seq, path = beam_search(posteriors, alphabet, beam_size=5, beam_cut_threshold=0.1)
end_time = time()
cost = end_time - start_time

print(f"beam search length of {l} for  {loop} times cost: {cost}s, mean: {cost/loop}")

start_time = time()
for lp in range(loop):
    seq, path = viterbi_search(posteriors, alphabet)
end_time = time()
cost = end_time - start_time

print(f"viterbi search length of {l} for  {loop} times cost: {cost}s, mean: {cost/loop}")

The results:
Arm:
beam search length of 1000 for 1000 times cost: 63.2727427482605s, mean: 0.0632727427482605
viterbi search length of 1000 for 1000 times cost: 2.8806848526000977s, mean: 0.002880684852600098

X86:
beam search length of 1000 for 1000 times cost: 1.4413361549377441s, mean: 0.0014413361549377442
viterbi search length of 1000 for 1000 times cost: 0.04563140869140625s, mean: 4.563140869140625e-05

So why is so slow in arm? How to optimize the speed in arm? And do you have a compiled arm package? Thanks a lot.

I haven’t benchmarked on arm myself but the differences you are reporting are much larger than I would expect. Is possible that you are using a debug build on arm and release build on x86? make test will result in debug build, can you rebuild and benchmark again after doing make clean && make build.

I haven’t benchmarked on arm myself but the differences you are reporting are much larger than I would expect. Is possible that you are using a debug build on arm and release build on x86? make test will result in debug build, can you rebuild and benchmark again after doing make clean && make build.

Thank you very much, I rebuilt the package with make clean && make build and got a wheel file named target/wheels/fast_ctc_decode-0.2.5-cp37-cp37m-manylinux1_aarch64.whl. I use pip3 install target/wheels/fast_ctc_decode-0.2.5-cp37-cp37m-manylinux1_aarch64.whl and got fast_ctc_decode-0.2.5-cp37-cp37m-manylinux1_aarch64.whl is not a supported wheel on this platform. error. After renamed the file to fast_ctc_decode-0.2.5-cp37-cp37m-linux_aarch64.whl, I installed the package successfully. And the new benchmark on arm:

beam search length of 1000 for 1000 times cost: 1.6493666172027588s, mean: 0.0016493666172027588
viterbi search length of 1000 for 1000 times cost: 0.06065034866333008s, mean: 6.065034866333008e-05

The time differences are small, thanks a lot.