huangnengCSU/compleasm

[Feature Request] Provide support for PyHMMER

Closed this issue · 9 comments

The current implementation uses HMMER but PyHMMER is much much faster when multi-threaded. Is there any plan to provide support for PyHMMER?

https://academic.oup.com/bioinformatics/article/39/5/btad214/7131068

image

If it's any help, I wrote a cli wrapper for PyHMMER called PyHMMSearch (https://github.com/jolespin/pyhmmsearch) which I've been using in place of hmmsearch to dramatically increase performance in multithreaded modes.

Hi @jolespin
Thank you for the wrapper. I am curious about the runtime difference between HMMER and pyhmmsearch when analyzing a single protein sequence with a single HMM file? In this scenario, whether the multithread mode will be helpful?

Multithreaded should be much quicker. I'll benchmark and send over the results to confirm (either today or tomorrow).

If you attach a sequence and HMM I'll test it out. If not, I'll just use a random set I already have downloaded.

Thanks, if works, I am happy to replace HMM with pyhmmsearch in compleasm

Here's my benchmarking command:

for N_THREADS in 1 2 4 8 12; 
do 
    time (pyhmmsearch.py -i test_1n.faa -m gathering -d PF00709.21.hmm -o test_1n/n-${N_THREADS}.tsv -p=${N_THREADS}) 2> test_1n/logs/n-${N_THREADS}.e > test_1n/logs/n-${N_THREADS}.o
done

All produced same results as expected:

(align_env) Joshs-MBP:test jolespin$ md5 test_1n/*.tsv
MD5 (test_1n/n-1.tsv) = af45b851b478d893758a2c3a16826496
MD5 (test_1n/n-12.tsv) = af45b851b478d893758a2c3a16826496
MD5 (test_1n/n-2.tsv) = af45b851b478d893758a2c3a16826496
MD5 (test_1n/n-4.tsv) = af45b851b478d893758a2c3a16826496
MD5 (test_1n/n-8.tsv) = af45b851b478d893758a2c3a16826496

Looks like the speed up is noticeable (assuming I'm using the time module right looking at real):

test_1n/logs/n-1.e:real	0m0.183s
test_1n/logs/n-2.e:real	0m0.079s
test_1n/logs/n-4.e:real	0m0.079s
test_1n/logs/n-8.e:real	0m0.078s
test_1n/logs/n-12.e:real	0m0.077s

Here's the sequence I used:

(align_env) Joshs-MBP:test jolespin$ cat test_n1.faa
>SRR13615825__k127_221120_1 # 1 # 987 # 1 # ID=13561_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.698
ILPYHRDLDLLSEARRGERKIGTTSRGIGPAYEDKIGRRGVRAGDLADPAALEEEIRENV
QARNRLVGDAAMDWRAIVDRLRSYGERMQPWIGDASLFLWDAIKAGRPVLFEGAQGTLLD
IDHGTYPYVTSSNSTIGGVCTGLGVGLHAVGGVIGVAKAYTTRVGGGPLPTELSGPLAER
LRESGQEYGASTGRPRRCGWYDAVAVRYAVRTNGLDGLALTKLDVLDGLETIDVCTAYRC
GGRTLTEFPADVRQLEACEPVYDRLPGWSRPTRGTRVFSDLPPDAQAYVRHLEQVSGVPV
TILSTGSDREDTIIREHSVIADWFALRP*

PF00709.21 HMM is attached:
PF00709.21.hmm.zip

lh3 commented

Compleasm already runs hmmer in multiple threads. I don't think pyhmmer will help.

@lh3 One of the benefits of PyHMMER is that the speed scales with the threads (Panel C in figure above from the manuscript). I've done some tests myself and noticed considerable performance gains when multithreaded:

Database Tool Single Threaded 12 Threads
Pfam PyHMMSearch 2:24 0:20
Pfam HMMER HMMSearch 2:53 2:27

* Time in minutes for 4977 proteins in test/test.faa.gz.
Test files are here: https://github.com/jolespin/pyhmmsearch/tree/main/test

PyHMMSearch is a CLI wrapper around PyHMMER.

I'd say PyHMMER is at least worth a shot to look into for performance gains.

lh3 commented

Compleasm is launching many independent hmmer runs in the single-threaded mode and schedules these runs to achieve overall multi-threading. Your pyhmmer won't help in this case. It also has one more dependency.

Compleasm is launching many independent hmmer runs in the single-threaded mode and schedules these runs to achieve overall multi-threading

Makes sense since compleasm is already overcoming the multi-threaded limitations of HMMER by splitting everything out manually. Though, there might be some overhead in loading the HMMs in this way.

Your pyhmmer won't help in this case.

I didn't develop PyHMMER (https://github.com/althonos/pyhmmer) just a big fan of the performance gains once I started using it (which is why I made the CLI wrapper around it).

It also has one more dependency.

Another aspect of PyHMMER I like is the lack of dependencies (not even HMMER needs to be installed) and it's a standalone Python executable so very easy to use/install in current environments.

Just wanted to put PyHMMER on the radar in case it could speed up the workflow since I've found it useful in my own workflows.