hitman: A Groovy repository from fangly

HiTMAn - High Throughput Microbial Analysis

Hitman is a pipeline for analysing 16S rRNA amplicon datasets. The default
values make it suitable for bacterial and archaeal profiling in 454 and Illumina
datasets.


INSTALLATION

Please install these dependencies first:
   perl
   Getopt::Euclid perl module
   IPC::System::Simple perl module
   bpipe 0.9.8.5
   pigz 2.1.6
   sff2fastq 0.8.0
   pear 0.9
   fastx_toolkit 0.0.13.2
   trimmomatic 0.32
   ea-utils 1.1.2
   acacia 1.50
   usearch v7
   emboss 6.6.0
   bio-community 0.001005
   bioperl

Then add the ./scripts/ and ./contrib/ folder into your PATH environment
variable.


DESCRIPTION

Hitman is a bioinformatic workflow based around the UPARSE methodology and
composed using bpipe. In brief, Hitman:
1/  demultiplexes libraries using EA-utils' fastq-multx tool
2/  joins read pairs with PEAR, but keeps the forward read when pairs cannot be
    joined
3/  clips sequencing adapters using TRIMMOMATIC
4/  truncates the 3' end of sequences at the first residue below a threshold
    quality Q value using TRIMMOMATIC
5/  trims the 3' end of all sequences to a target length L using TRIMMOMATIC, 
    discarding all smaller sequences
6/  removes sequences exceeding the maximum number of expected errors E using
    USEARCH's fastq_filter
7/  uses USEARCH's cluster_otus to form operational taxonomic units (OTUs) from
    high-fidelity sequences (stringent QA in steps 4 and 6) that are sorted by
    decreasing abundance, occur at least twice in the dataset and have at least
    O% similarity
8/  discards chimeric OTUs in a reference-independent using USEARCH's
    cluster_otus, and in a reference-based fashion using UCHIME with database C
9/  assigns regular-fidelity sequences (less stringent QA in steps 4 and 6) to
    each OTU using USEARCH's usearch_global
10/ formats the results in BIOM format using Bio-Community's bc_convert_files
11/ gives a taxonomic assignment to each OTU by globally aligning their reference
    sequences against a database T of reference sequences trimmed to the target
    region (keeping only the best-matching alignment with at least I% identity)
    using USEARCH's usearch_global
12/ removes OTUs belonging to specific taxa W using Bio-Community's
    bc_manage_samples
13/ rarefies the microbiome profiles at the given depth D with Bio-Community's
    bc_accumulate
14/ corrects gene-copy number (GCN) bias using CopyRighter.

Many of these steps are optional or accept user-specified parameters. The
output is a BIOM file containing OTU-level, rarefied, GCN-corrected microbial
profiles with taxonomic affiliations.


RUNNING HITMAN

Three types of input files are necessary:
1/ SFF or FASTQ files containing regular or paired-end amplicon reads (can be
   gzipped)
2/ FASTA file containing the PCR primers used
3/ If the sequencing run was 454-multiplexed, a 2-column tabular mapping files
   indicating the multiplex identifiers (MIDs) used for each sample

The pipeline itself comprises three steps:
1/ hitman_hire: Collect amplicon reads to analyze from SFF/FASTQ files.
2/ hitman_deploy: Quality-filter and trim the sequences.
3/ hitman_execute: Produce a rarefied OTU table with taxonomic assignments.

Run these scripts with the --man flag to learn more about what they do.

A typical 454 GS-FLX Ti processing looks like this:
   hitman_hire    --dir 01-hire/    --input Gasket67.sff --mapping Gasket67.mapping
   ln -s 01-hire/*cat_files.fastq seqs.fastq
   hitman_deploy  --dir 02-deploy/  --input seqs.fastq
   ln -s 02-deploy/*trunc_seqs.fna seqs_qa.fna
   hitman_execute --dir 03-execute/ --input seqs_qa.fna --primers primers.fna

For typical Illumina MiSeq processing (with paired-end reads and no mapping file),
we can skip the Acacia homopolymer denoising step and increase rarefaction depth
from 1,000 (by default) to 10,000 reads:
   hitman_hire    --dir 01-hire/    --five Sample1.R1.fasta.gz Sample2.R1.fasta.gz --three Sample1.R2.fasta.gz Sample2.R2.fasta.gz
   ln -s 01-hire/*cat_files.fastq seqs.fastq
   hitman_deploy  --dir 02-deploy/  --input seqs.fastq -p SKIP_DENOISING=1 -p LIB_ADAPTERS=/path/to/trimmomatic/adapters/NexteraPE-PE.fa
   ln -s 02-deploy/*trunc_seqs.fna seqs_qa.fna
   hitman_execute --dir 03-execute/ --input seqs_qa.fna --primers primers.fna -p RAREFACTION_DEPTH=10000

If you are feeling comfortable with the process, though, I advise running a four-step process:
1/ hitman_hire (as usual)
2/ hitman_deploy, using stringent quality cutoffs
   (these reads will be used to determine OTU representative sequences only)
3/ hitman_deploy, using regular quality cutoffs
   (these reads will be mapped to high-quality OTUs)
4/ hitman_execute, using --input and --input2 to pass high- and regular- quality reads

Note: Hitman relies quite heavily on file names and extensions to work. Make
sure your files are named properly.


COPYRIGHT

Copyright 2012-2014 Florent ANGLY <florent.angly@gmail.com>

Hitman is free software: you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free
Software Foundation, either version 3 of the License, or (at your
option) any later version.

Hitman is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
more details.

You should have received a copy of the GNU General Public License along
with Hitman. If not, see <http://www.gnu.org/licenses/>.
fangly/hitman