[Feature Request] Provide support for PyHMMER

Question

[Feature Request] Provide support for PyHMMER

Closed this issue 18 days ago · 9 comments

The current implementation uses HMMER but PyHMMER is much much faster when multi-threaded. Is there any plan to provide support for PyHMMER?

https://academic.oup.com/bioinformatics/article/39/5/btad214/7131068

Answer 1 · 2024-09-29T07:37:51.000Z

If it's any help, I wrote a cli wrapper for PyHMMER called PyHMMSearch (https://github.com/jolespin/pyhmmsearch) which I've been using in place of hmmsearch to dramatically increase performance in multithreaded modes.

Answer 2 · 2024-09-29T15:29:35.000Z

Hi @jolespin
Thank you for the wrapper. I am curious about the runtime difference between HMMER and pyhmmsearch when analyzing a single protein sequence with a single HMM file? In this scenario, whether the multithread mode will be helpful?

Answer 3 · 2024-09-29T15:31:32.000Z

Multithreaded should be much quicker. I'll benchmark and send over the results to confirm (either today or tomorrow).

If you attach a sequence and HMM I'll test it out. If not, I'll just use a random set I already have downloaded.

Answer 4 · 2024-09-29T15:37:05.000Z

Thanks, if works, I am happy to replace HMM with pyhmmsearch in compleasm

Answer 5 · 2024-09-30T21:22:15.000Z

Here's my benchmarking command:

for N_THREADS in 1 2 4 8 12; 
do 
    time (pyhmmsearch.py -i test_1n.faa -m gathering -d PF00709.21.hmm -o test_1n/n-${N_THREADS}.tsv -p=${N_THREADS}) 2> test_1n/logs/n-${N_THREADS}.e > test_1n/logs/n-${N_THREADS}.o
done

All produced same results as expected:

(align_env) Joshs-MBP:test jolespin$ md5 test_1n/*.tsv
MD5 (test_1n/n-1.tsv) = af45b851b478d893758a2c3a16826496
MD5 (test_1n/n-12.tsv) = af45b851b478d893758a2c3a16826496
MD5 (test_1n/n-2.tsv) = af45b851b478d893758a2c3a16826496
MD5 (test_1n/n-4.tsv) = af45b851b478d893758a2c3a16826496
MD5 (test_1n/n-8.tsv) = af45b851b478d893758a2c3a16826496

Looks like the speed up is noticeable (assuming I'm using the time module right looking at real):

test_1n/logs/n-1.e:real	0m0.183s
test_1n/logs/n-2.e:real	0m0.079s
test_1n/logs/n-4.e:real	0m0.079s
test_1n/logs/n-8.e:real	0m0.078s
test_1n/logs/n-12.e:real	0m0.077s

Here's the sequence I used:

(align_env) Joshs-MBP:test jolespin$ cat test_n1.faa
>SRR13615825__k127_221120_1 # 1 # 987 # 1 # ID=13561_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.698
ILPYHRDLDLLSEARRGERKIGTTSRGIGPAYEDKIGRRGVRAGDLADPAALEEEIRENV
QARNRLVGDAAMDWRAIVDRLRSYGERMQPWIGDASLFLWDAIKAGRPVLFEGAQGTLLD
IDHGTYPYVTSSNSTIGGVCTGLGVGLHAVGGVIGVAKAYTTRVGGGPLPTELSGPLAER
LRESGQEYGASTGRPRRCGWYDAVAVRYAVRTNGLDGLALTKLDVLDGLETIDVCTAYRC
GGRTLTEFPADVRQLEACEPVYDRLPGWSRPTRGTRVFSDLPPDAQAYVRHLEQVSGVPV
TILSTGSDREDTIIREHSVIADWFALRP*

PF00709.21 HMM is attached:
PF00709.21.hmm.zip

Answer 6 · 2024-10-01T12:52:52.000Z

Compleasm already runs hmmer in multiple threads. I don't think pyhmmer will help.

Answer 7 · 2024-10-01T18:02:51.000Z

@lh3 One of the benefits of PyHMMER is that the speed scales with the threads (Panel C in figure above from the manuscript). I've done some tests myself and noticed considerable performance gains when multithreaded:

Database	Tool	Single Threaded	12 Threads
Pfam	PyHMMSearch	2:24	0:20
Pfam	HMMER HMMSearch	2:53	2:27

* Time in minutes for 4977 proteins in test/test.faa.gz.
Test files are here: https://github.com/jolespin/pyhmmsearch/tree/main/test

PyHMMSearch is a CLI wrapper around PyHMMER.

I'd say PyHMMER is at least worth a shot to look into for performance gains.

Answer 8 · 2024-10-01T22:16:04.000Z

Compleasm is launching many independent hmmer runs in the single-threaded mode and schedules these runs to achieve overall multi-threading. Your pyhmmer won't help in this case. It also has one more dependency.

Answer 9 · 2024-10-01T22:37:11.000Z

Compleasm is launching many independent hmmer runs in the single-threaded mode and schedules these runs to achieve overall multi-threading

Makes sense since compleasm is already overcoming the multi-threaded limitations of HMMER by splitting everything out manually. Though, there might be some overhead in loading the HMMs in this way.

Your pyhmmer won't help in this case.

I didn't develop PyHMMER (https://github.com/althonos/pyhmmer) just a big fan of the performance gains once I started using it (which is why I made the CLI wrapper around it).

It also has one more dependency.

Another aspect of PyHMMER I like is the lack of dependencies (not even HMMER needs to be installed) and it's a standalone Python executable so very easy to use/install in current environments.

Just wanted to put PyHMMER on the radar in case it could speed up the workflow since I've found it useful in my own workflows.