/mof-search

BLAST-like search across all pre-2019 bacteria on ordinary laptop and desktop computers within a few hours

Primary LanguagePythonOtherNOASSERTION

MOF-Search

Pipeline for BLAST-like search across all pre-2019 bacteria from ENA on standard desktop and laptops computers. MOF-Search uses phylogenetically compressed assemblies and their k-mer indexes to align batches of queries to them by Minimap 2, all within only several hours.


Info Paper DOI Tests GitHub release

Contents

1. Introduction

The central idea behind MOF-Search, enabling alignment locally at such a large scale, is phylogenetic compression (paper) - a technique based on using estimated evolutionary history to guide compression and search of large genome collections using existing algorithms and data structures.

In short, input data are reorganized according to the topology of the estimated phylogenies, which makes data highly locally compressible even using basic techniques. Existing software packages for compression, indexing, and search

  • in this case XZ, COBS, and Minimap2 - are then used as low-level tools. The resulting performance gains come from a wide range of benefits of phylogenetic compression, including easy parallelization, small memory requirements, small database size, better memory locality, and better branch prediction.

For more information about phylogenetic compression and the implementation details of MOF-Search, see the corresponding paper (including its supplementary material and visit the associated website.

Citation

K. Břinda, L. Lima, S. Pignotti, N. Quinones-Olvera, K. Salikhov, R. Chikhi, G. Kucherov, Z. Iqbal, and M. Baym. Efficient and Robust Search of Microbial Genomes via Phylogenetic Compression. bioRxiv 2023.04.15.536996, 2023. https://doi.org/10.1101/2023.04.15.536996

2. Requirements

2a) Hardware

MOF-Search requires a standard desktop or laptop computer with an *nix system, and it can also run on a cluster. The minimal hardware requirements are 12 GB RAM and approximately 120 GB of disk space (102 GB for the database and a margin for intermediate files).

2b) Dependencies

MOF-Search is implemented as a Snakemake pipeline, using the Conda system to manage non-standard dependencies. Ensure you have Conda installed with the following packages:

  • GNU Time (on Linux present by default; on OS X, install with brew install gnu-time).
  • Python (>=3.7)
  • Snakemake (>=6.2.0)
  • Mamba (>= 0.20.0) - optional, but recommended

Additionally, MOF-Search uses standard Unix tools like GNU Make, cURL, XZ Utils, and GNU Gzip. These tools are typically included in standard *nix installations. However, in minimal setups (e.g., virtualization, continuous integration), you might need to install them using the corresponding package managers.

3. Installation

3a) Step 1: Install dependencies

Make sure you have Conda and GNU Time installed. On Linux:

sudo apt-get install conda

On OS X (using Homebrew):

brew install conda
brew install gnu-time

Install Python (>=3.7), Snakemake (>=6.2.0), and Mamba (optional but recommended) using Conda:

conda install -y -c bioconda -c conda-forge \
    "python>=3.7" "snakemake>=6.2.0" "mamba>=0.20.0"

3b) Step 2: Clone the repository

Clone the MOF-Search repository from GitHub and navigate into the directory:

 git clone https://github.com/karel-brinda/mof-search
 cd mof-search

3c) Step 3: Run a simple test

Run the following command to ensure the pipeline works for sample queries and 3 batches (this will also install all additional dependencies using Conda):

make test

Make sure the test returns 0 (success) and that you see the expected output message:

 Success! Test run produced the expected output.

3d) Step 4: Download the database

Download all phylogenetically compressed assemblies and COBS k-mer indexes for the 661k-HQ collection by:

make download

The downloaded files will be located in the asms/ and cobs/ directories.

Notes:

  • The compressed assemblies comprise all the genomes from the 661k collection.The COBS indexes comprise only those genomes that passed quality control.

4. Usage

4a) Step 1: Copy or symlink your queries

Remove the default test files or your old files in the input/ directory and copy or symlink (recommended) your query files. The supported input formats are FASTA and FASTQ, possibly gzipped. All query files will be preprocessed and merged together.

Notes:

  • All query names have to be unique among all query files.
  • All non-ACGT characters in your query sequences will be translated to A.

4b) Step 2: Adjust configuration

Edit the config.yaml file for your desired search. All available options are documented directly there.

4c) Step 3: Clean up intermediate files

Run make clean to clean intermediate files from the previous runs. This includes COBS matching files, alignment files, and various reports.

4d) Step 4: Run the pipeline

Simply run make, which will execute Snakemake with the corresponding parameters. If you want to run the pipeline step by step, run make match followed by make map.

Optional replace make match with make label (running make label right after make match will also work). This will use the outputs from make match to generate the labels (at species taxonomic level) of the matches. The labels are saved here: intermediate/04_filter/labels, and there you will find

  • CSV: in each row the first element is the query-id, the other values correspond to the labels of the filtered intermediate/04_filter/{name}.fa, in the same order.
  • json: keys are the filenames of files in input/ (without extension), for each filename, a list with the id of its sequences.
  • txt: consensus labels for each input file (majority vote among all labels from its sequences)

4e) Step 5: Analyze your results

Check the output files in output/ (for more info about formats, see 5c) File formats).

If the results do not correspond to what you expected and you need to re-adjust your search parameters, go to Step 2. If only the mapping part is affected by the changes, you proceed more rapidly by manually removing the files in intermediate/05_map and output/ and running directly make map.

5. Additional information

5a) List of workflow commands

MOF-Search is executed via GNU Make, which handles all parameters and passes them to Snakemake.

Here's a list of all implemented commands (to be executed as make {command}):

####################
# General commands #
####################
   all                Run everything (the default rule)
   test               Quick test using 3 batches
   help               Print help messages
   clean              Clean intermediate search files
   cleanall           Clean all generated and downloaded files
##################
# Pipeline steps #
##################
   conda              Create the conda environments
   download           Download the assemblies and COBS indexes
   download_asms      Download only the assemblies
   download_cobs      Download only the COBS indexes
   match              Match queries using COBS (queries -> candidates)
   label              Label queries using labels from candidates (candidates -> labels) ** [new] **
   map                Map candidates to assemblies (candidates -> alignments)
#############
# Reporting #
#############
   config             Print configuration without comments
   report             Generate Snakemake report
########
# Misc #
########
   cluster_slurm      Submit to a SLURM cluster
   cluster_lsf_test   Submit the test pipeline to LSF cluster
   cluster_lsf        Submit to LSF cluster
   format             Reformat Python and Snakemake files

Note: make format requires YAPF and Snakefmt, which can be installed by conda install -c conda-forge -c bioconda yapf snakefmt.

5b) Directories

  • asms/, cobs/ Downloaded assemblies and COBS indexes
  • input/ Queries, to be provided within one or more FASTA/FASTQ files, possibly gzipped (.fa)
  • intermediate/ Intermediate files
    • 00_queries_preprocessed/ Preprocessed queries
    • 01_queries_merged/ Merged queries
    • 02_cobs_decompressed/ Decompressed COBS indexes (temporary, used only in the disk mode is used)
    • 03_match/ COBS matches
    • 04_filter/ Filtered candidates
      • labels/ Labels of the filtered candidates by query.
    • 05_map/ Minimap2 alignments
  • logs/ Logs and benchmarks
  • output/ The resulting files (in a headerless SAM format)

5c) File formats

Input files: FASTA or FASTQ files possibly compressed by gzipped. The files are searched in the input/ directory, as files with the following suffixes: .fa, .fasta, .fna, .fq, .fastq (possibly with .gz at the end).

Output files:

  • output/{name}.sam_summary.gz: output alignments in a headerless SAM format
  • output/{name}.sam_summary.stats: statistics about your computed alignments in TSV

SAM headers are omitted as all search experiments generate hits across large numbers of assemblies (many of them being spurious). As a result, SAM headers then dominate the outputs. Nevertheless, we note that, in principle, the SAM headers can always be recreated from the FASTA files in asms/, although this functionality is not currently implemented.

5d) Running on a cluster

Running on a cluster is much faster as the jobs produced by this pipeline are quite light and usually start running as soon as they are scheduled.

For LSF clusters:

  1. Test if the pipeline is working on a LSF cluster: make cluster_lsf_test;
  2. Configure you queries and run the full pipeline: make cluster_lsf;

5e) Known limitations

  • Swapping if the number of queries too high. If the number of queries is too high (e.g., 10M Illumina reads), the auxiliary Python scripts start to use too much memory, which may result in swapping. Try to keep the number of queries moderate and ideally their names short.
  • No support for ambiguous characters in queries. As the tools used internally by MOF-Search support only the nucleotide alphabet, all non-ACGT characters in queries are first converted to A.
  • Too many reported hits. When queries have too many equally good hits in the database, even if the threshold on the maximum number of hits is chosen low – for instance 10 – the program will take top 10 + ties, which can be a huge number (especially for short sequences).

6. License

MIT

7. Contacts