/pyhmmsearch

Fast implementation of HMMSEARCH optimized for high-memory systems using PyHmmer

Primary LanguagePythonMIT LicenseMIT

PyHMMSearch

Fast implementation of HMMSearch optimized for high-memory systems using PyHmmer. PyHMMSearch can handle fasta in uncompressed or gzip format and databases in either HMM or Python pickle serialized format. No intermediate files are created.

Installation:

pip install pyhmmsearch

Dependencies:

  • pyhmmer >=0.10.12
  • pandas
  • tqdm

Benchmarking:

Database Tool Single Threaded 12 Threads
Pfam PyHMMSearch 2:24 0:20
Pfam HMMER HMMSearch 2:53 2:27

* Time in minutes for 4977 proteins in test/test.faa.gz.

Official benchmarking for hmmsearch algorithm implemented in PyHMMER against HMMER from Larralde et al. 2023:

drawing

Usage:

Recommended usage for PyHMMSearch is on systems with 1) high RAM; 2) large numbers of threads; and/or 3) reading/writing to disk is charged (e.g., AWS EFS). Also useful when querying a large number of proteins.

  • Using the official Pfam database files:

    # Download database
    DATABASE_DIRECTORY=/path/to/database_directory/
    mkdir -p ${DATABASE_DIRECTORY}/Annotate/Pfam
    wget -v -P ${DATABASE_DIRECTORY}/Annotate/Pfam https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz 
    
    # Run PyHMMSearch
    pyhmmsearch -i test/test.faa.gz  -o output.tsv -b ${DATABASE_DIRECTORY}/Annotate/Pfam/Pfam-A.hmm.gz -p=-1
  • Build a serialized database:

    # Provide a database
    serialize_hmm_models -d path/to/Pfam-A.hmm.gz  -b path/to/database.pkl.gz
    
    # or a directory of HMMs
    serialize_hmm_models -d path/to/hmm_directory/  -b path/to/database.pkl.gz
    
    # or from a list of filepaths to HMM models
    serialize_hmm_models -l path/to/hmms.list  -b path/to/database.pkl.gz
    
    # or form a list through stdin
    ls path/to/directory/*.hmm |  serialize_hmm_models -b path/to/database.pkl.gz
  • Using the serialized database files:

    Database can be uncompressed pickle or gzipped pickle.

    pyhmmsearch -i test/test.faa.gz  -o output.tsv -b ~/Databases/Pfam/database.pkl.gz -p=-1
  • Using a custom threshold table (e.g., BUSCO Markers):

    pyhmmsearch -i test/test.faa.gz  -o output.tsv -d test/bacteria_odb10/bacteria_odb10.hmm.gz -s test/bacteria_odb10/scores_cutoff -f name -p=-1
  • Grouping hits by query protein:

    reformat_pyhmmsearch -i pyhmmsearch_output.tsv -o pyhmmsearch_output.reformatted.tsv

Options:

$ pyhmmsearch -h
usage: pyhmmsearch -i <proteins.fasta> -o <output.tsv> -d

    Running: pyhmmsearch v2024.4.25 via Python v3.10.14 | /Users/jolespin/miniconda3/envs/kofamscan_env/bin/python

options:
  -h, --help            show this help message and exit

I/O arguments:
  -i PROTEINS, --proteins PROTEINS
                        path/to/proteins.fasta. stdin does not stream and loads everything into memory. [Default: stdin]
  -o OUTPUT, --output OUTPUT
                        path/to/output.tsv [Default: stdout]
  --no_header           No header

Utility arguments:
  -p N_JOBS, --n_jobs N_JOBS
                        Number of threads to use [Default: 1]

HMMSearch arguments:
  -s SCORES_CUTOFF, --scores_cutoff SCORES_CUTOFF
                        path/to/scores_cutoff.tsv [id_hmm]<tab>[score_threshold], No header.
  -f {accession,name}, --hmm_marker_field {accession,name}
                        HMM reference type (accession, name) [Default: accession]
  -t SCORE_TYPE, --score_type SCORE_TYPE
                        {full, domain} [Default: full]
  -m {gathering,noise,e,trusted}, --threshold_method {gathering,noise,e,trusted}
                        Cutoff threshold method [Default:  e]
  -e EVALUE, --evalue EVALUE
                        E-value threshold [Default: 10.0]

Database arguments:
  -d HMM_DATABASE, --hmm_database HMM_DATABASE
                        path/to/database.hmm cannot be used with -b/-serialized_database
  -b SERIALIZED_DATABASE, --serialized_database SERIALIZED_DATABASE
                        path/to/database.pkl cannot be used with -d/--database_directory.  Database should be pickled dictionary {name:hmm}

Copyright 2024 Josh L. Espinoza (jolespin@newatlantis.io)

Outputs:

  • From pyhmmsearch:

    id_protein id_hmm threshold score bias best_domain-score best_domain-bias e-value
    SRR13615825__k127_453760_1 PF00389.34 (24.600000381469727, 24.600000381469727) 93.686 6.702 89.856 6.702 1.984e-27
    SRR13615825__k127_295655_1 PF00389.34 (24.600000381469727, 24.600000381469727) 83.195 0.005 83.167 0.005 3.456e-24
    SRR13615825__k127_218710_3 PF00389.34 (24.600000381469727, 24.600000381469727) 42.235 0.004 42.073 0.004 1.559e-11
    SRR13615825__k127_272080_1 PF00389.34 (24.600000381469727, 24.600000381469727) 24.673 0.000 22.067 0.000 4.154e-06
    SRR13615825__k127_297426_1 PF02826.23 (25.100000381469727, 25.100000381469727) 170.426 0.003 170.122 0.003 6.392e-51
  • From reformat_pyhmmsearch:

    id_protein number_of_hits ids evalues scores
    SRR13615825__k127_453760_1 3 ['PF00389.34', 'PF02826.23', 'PF03446.19'] [1.984e-27, 2.113e-39, 2.41e-08] [93.686, 132.902, 32.336]
    SRR13615825__k127_295655_1 2 ['PF00389.34', 'PF02826.23'] [3.456e-24, 7.794e-21] [83.195, 72.421]
    SRR13615825__k127_218710_3 1 ['PF00389.34'] [1.559e-11] [42.235]
    SRR13615825__k127_272080_1 2 ['PF00389.34', 'PF02826.23'] [4.154e-06, 2.035e-41] [24.673, 139.471]
    SRR13615825__k127_297426_1 1 ['PF02826.23'] [6.392e-51] [170.426]
  • From reformat_pyhmmsearch with -b/--best_hits_only:

    id_protein id evalue score
    SRR13615825__k127_453760_1 PF02826.23 2.113e-39 132.902
    SRR13615825__k127_295655_1 PF00389.34 3.456e-24 83.195
    SRR13615825__k127_218710_3 PF00389.34 1.559e-11 42.235
    SRR13615825__k127_272080_1 PF02826.23 2.035e-41 139.471
    SRR13615825__k127_297426_1 PF02826.23 6.392e-51 170.426

If you use this tool, please cite the following sources:

  • Eddy SR. Accelerated Profile HMM Searches. PLoS Comput Biol. 2011 Oct;7(10):e1002195. doi: 10.1371/journal.pcbi.1002195. Epub 2011 Oct 20. PMID: 22039361; PMCID: PMC3197634.

  • Larralde M, Zeller G. PyHMMER: a Python library binding to HMMER for efficient sequence analysis. Bioinformatics. 2023 May 4;39(5):btad214. doi: 10.1093/bioinformatics/btad214. PMID: 37074928; PMCID: PMC10159651.

License:

The code for PyHMMSearch is licensed under an MIT License

Please contact jolespin@newatlantis.io regarding any licensing concerns.