/IGGsearch

Metagenomic species profiling with enhanced coverage of the human gut microbiome

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

IGGsearch

This repository allows you to perform metagenomic species profiling described by http://dx.doi.org/10.1038/s41586-019-1058-x. The pipeline is being actively improved and may change in the future.

Database information

Species-specific marker genes were identified based on two main criterea: i) conserved across genomes within a species, and ii) rarely occurring in genomes from different species. Marker genes for gut species were additionally refined based on co-variation in abundance across metagenomes.

Marker genes were identified from 209,320 genomes from the IGGdb, corresponding to 23,790 species based on 95% genome-wide average nucleotide identity. This genome set includes a large number of genomes from the Human Gut MAG dataset (N=60,664) as well as 148,656 reference genomes and MAGs from PATRIC and IMG.

Quick start

Install dependencies:

Note: tested versions are indicated in parenthesis, but other versions might also work

Python dependencies can be installed using pip: pip install numpy pysam

Bowtie2 and samtools should be present on your path:
bowtie2 --help
samtools --help

Clone the repo from github: git clone https://github.com/snayfach/IGGsearch

Update your environment:
export PYTHONPATH=$PYTHONPATH:/path/to/IGGsearch/iggsearch
export PATH=$PATH:/path/to/IGGsearch

Note: replace /path/to with the correct file path on your system

Download and install the reference database:

Download: run_iggsearch.py download --gut-only
Unpack: tar -zxvf iggdb_v1.0.0_gut.tar.gz
Install:export IGG_DB=/path/to/iggdb_v1.0.0_gut

Note: use run_iggsearch.py download to download full database including non-gut species

Test the code using the dummy dataset:
run_iggsearch.py search --m1 test.fastq.gz --outdir test --test

Program options

Currently, there are four main modules, which can be viewed using: run_iggsearch.py -h

Description: Metagenomic species profiling with enhanced coverage of the human gut microbiome

Usage: iggsearch.py <command> [options]

Commands:
   download download reference database of marker genes
     search estimate species abundance from a single metagenome
      merge generate matrix files from multiple runs
   reformat change the file format of from a single run

Note: use iggsearch.py <command> -h to view usage for a specific command

Options for the main module can be viewed using: run_iggsearch.py search -h

IGGsearch: estimate species abundance from a single metagenome

optional arguments:
  -h, --help            show this help message and exit

input/output:
  --outdir PATH         Directory to store results.
                        Name should correspond to unique identifier for your sample
  --m1 PATH             FASTA/FASTQ file containing 1st mate if using paired-end reads.
                        Otherwise FASTA/FASTQ containing unpaired reads.
                        Can be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2)
                        Use comma ',' to separate multiple input files (ex: -1 file1.fq,file2.fq)
  --m2 PATH             FASTA/FASTQ file containing 2nd mate if using paired-end reads.
  --db_dir PATH         Path to reference database. By default, the IGG_DB environmental variable is used

pipeline speed/throughput:
  --max-reads INT       Number of reads to use from input file(s) (use all)
  --threads INT         Number of threads to use for read-alignment (1)
  --no-align            Skip read alignment if 'mapped_reads.bam' already exists (False)
                        Useful for rerunning pipeline with different options
  --test                Perform a quick testing run (False)

read alignment/quality control:
  --mapid FLOAT         Minimum DNA alignment identity between read and marker gene database (95.0)
  --aln_cov FLOAT       Minimum fraction of read covered by alignment (0.75)
  --readq FLOAT         Minimum average base quality score of reads (20.0)
  --mapq FLOAT          Minimum mapping quality of reads (30.0)

species reporting:
  --min-reads-gene INT  Minimum # of reads for detecting marker genes (2)
  --min-perc-genes INT  Minimum % of marker genes detected to report species (40)
  --min-sp-quality INT  Minimum quality score to report species (50)
                        where quality score = completeness - (5 x contamination) of best genome
  --all-species         Presets: --min-reads-gene=0 --min-perc-genes=0 --min-sp-quality=0
  --very-lenient        Presets: --min-reads-gene=1 --min-perc-genes=1 --min-sp-quality=0
  --lenient             Presets: --min-reads-gene=1 --min-perc-genes=15 --min-sp-quality=25
  --strict              Presets: --min-reads-gene=2 --min-perc-genes=40 --min-sp-quality=50 (default)
  --very-strict         Presets: --min-reads-gene=5 --min-perc-genes=60 --min-sp-quality=75