AMPtk

NGS AMPlicon Tool Kit:

AMPtk is a series of scripts to process NGS amplicon data using USEARCH and VSEARCH, it can also be used to process any NGS amplicon data and includes databases setup for analysis of fungal ITS, fungal LSU, bacterial 16S, and insect COI amplicons. It can handle Ion Torrent, MiSeq, and 454 data. At least USEARCH v9.1.13 and VSEARCH v2.2.0 are required as of AMPtk v0.7.0.

Installation:

Mac install instructions
Linux install instructions
Windows install instuructions - Note use on Windows is not recommended.

AMPtk Wrapper script

AMPtk comes with a wrapper script for ease of use. On UNIX, you can call it by simply typing amptk, while on windows you may need to type amptk.py (not needed if you add the .py extension in your PATHEXT, directions here).

$ amptk
Usage:       amptk <command> <arguments>
version:     0.10.0

Description: AMPtk is a package of scripts to process NGS amplicon data.  
             Dependencies:  USEARCH v9.1.13 and VSEARCH v2.2.0
    
Process:     ion         pre-process Ion Torrent data
             illumina    pre-process folder of de-multiplexed Illumina data
             illumina2   pre-process PE Illumina data from a single file
             illumina3   pre-process PE Illumina + index reads (i.e. R1, R2, I)
             454         pre-process Roche 454 (pyrosequencing) data
             SRA         pre-process single FASTQ per sample data (i.e. SRA data)
             
Clustering:  cluster     cluster OTUs (using UPARSE algorithm)
             dada2       dada2 denoising algorithm (requires R, dada2, ShortRead)
             unoise2     UNOISE2 denoising algorithm
             cluster_ref closed/open reference based clustering (EXPERIMENTAL)

Utilities:   filter      OTU table filtering
             taxonomy    Assign taxonomy to OTUs
             show        show number or reads per barcode from de-multiplexed data
             select      select reads (samples) from de-multiplexed data
             remove      remove reads (samples) from de-multiplexed data
             sample      sub-sample (rarify) de-multiplexed reads per sample
             drop        Drop OTUs from dataset
             summarize   Summarize Taxonomy (create OTU-like tables and/or stacked bar graphs)
             funguild    Run FUNGuild (annotate OTUs with ecological information) 
             meta        pivot OTU table and append to meta data
             heatmap     Create heatmap from OTU table
             SRA-submit  De-multiplex data and create meta data for NCBI SRA submission

Setup:       install     Download/install pre-formatted taxonomy DB. Only need to run once.
             database    Format Reference Databases for Taxonomy
             primers     List primers hard-coded in AMPtk. Can use in pre-processing steps.

And then by calling one of the commands, you get a help menu for each:

$ amptk cluster
Usage:       amptk cluster <arguments>
version:     0.10.0

Description: Script is a "wrapper" for the UPARSE algorithm. FASTQ quality trimming via expected 
             errors and Dereplication are run in vsearch if installed otherwise defaults to Python 
             which allows for the use of datasets larger than 4GB.  
             Chimera filtering and UNOISE are also options.
    
Arguments:   -i, --fastq         Input FASTQ file (Required)
             -o, --out           Output base name. Default: out
             -e, --maxee         Expected error quality trimming. Default: 1.0
             -p, --pct_otu       OTU Clustering Radius (percent). Default: 97
             -m, --minsize       Minimum size to keep (singleton filter). Default: 2
             --uchime_ref        Run Ref Chimera filtering. Default: off [ITS, LSU, COI, 16S, custom path]
             --map_filtered      Map quality filtered reads back to OTUs. Default: off
             --unoise            Run De-noising pre-clustering (UNOISE). Default: off
             --debug             Keep intermediate files.
             -u, --usearch       USEARCH executable. Default: usearch9

Installing Databases:

AMPtk is pre-configured to deal with amplicons from fungal ITS, fungal LSU, bacterial 16S, and insect COI. You can see how the databases were constructed here. After installation of AMPtk, you can download and install these databases using the amptk install command (to overwrite existing databases, use the --force option. To install all of the databases, you would type:

#install all databases
amptk install -i ITS LSU 16S COI

#install only ITS databases
amptk install -i ITS

The resulting databases are stored in the DB folder of the amptk directory. These data are used for both Chimera reference filtering and for assigning taxonomy.

What processing script do I use??

You need to be familiar with your read structure! For example, do you have barcodes at the 5' end of your amplicons? Is the primer sequence in your reads or has it been removed by the sequencing software? I prefer that data be minimally processed - this acts largely as a quality control filter. For example when using Ion Torrent data (see instructions below) it is beneficial to turn off all filtering of the data on the server, by doing this you ensure that the reads are properly quality filtered and barcodes/primers are used as an additional quality filter. If you have PE MiSeq data, typically it comes in an already demultiplexed format, i.e. a folder of PE reads for each sample (_R1.fastq.gz, _R2.fastq.gz) - these data can be used directly with the amptk illumina command. A somewhat common (although I think not ideal) approach is to use a custom sequencing primer with Illumina data - and then the output from the sequencer is already demulitplexed and primers are already stripped - in this case, you can use the amptk illumina command, but make sure to pass the --require_primer off option. However, I have seen several other formats of data as well - even with Illumina data being in a single file, where then the amptk illumina2 script is helpful. Bottom line is if you don't know how your data is structured in terms of barcodes, primers, reads, paired-ends, etc - you need to find out before doing anything else....

Processing Ion Torrent Data:

From the Torrent Server, analyze the data using the --disable-all-filters BaseCaller argument. This will leave the adapters/key/barcode sequence intact. You can download either the unaligned BAM file from the server or a FASTQ file. You can then de-multiplex the data as follows:

amptk ion -i data.bam -o data -f fITS7 -r ITS4
amptk ion --barcodes 1,5,24 -i data.fastq.gz -o data -f ITS1-F -r ITS2
amptk ion --barcode_fasta my_barcodes.fa -i data.fastq -o data -f ITS1 -r ITS2
amptk ion -i data.bam -m mapping_file.txt -o data

This will find Ion barcodes (1, 5, and 24) and relabel header with that information (barcodelabel=BC_5;). By default, it will look for all 96 Ion Xpress barcodes, specifiy the barcodes you used by a comma separated list. You can also pass in a fasta file containing your barcode sequences with properly labeled headers. Next the script will find and trim both the forward and reverse primers (default is ITS2 region: fITS7 & ITS4), and then finally will trim or pad with N's to a set length (default: 250 bp). Trimming to the same length is critcally important for USEARCH to cluster correctly, padding with N's after finding the reverse primer keeps short ITS sequences from being discarded. These options can be customized using: --fwd_primer, --rev_primer, --trim_len, etc.

Processing Roche 454 Data:

Data from 454 instruments has the same read structure as Ion Torrent: Read and thus can be processed very similarly. You just need to provide either an SFF file, FASTA + QUAL, or FASTQ files and then you need to specify a multi-fasta file containing the barcodes used in the project. The data will be processed in the same fashion (see above) as the Ion Torrent Data. For example:

#SFF input
amptk 454 -i data.sff --barcode_fasta my454barcodes.fa -o 454project

#FASTA/QUAL input
amptk 454 -i data.fa -q data.qual --barcode_fasta my454barcodes.fa -o 454project

Processing Illumina MiSeq PE Data:

Paired-end MiSeq data is typically delivered already de-multiplexed into separate read files, that have a defined naming structure from Illumina that looks like this:

<sample name>_<barcode sequence>_L<lane (0-padded to 3 digits)>_R<read number>_<set number (0-padded to 3 digits>.fastq.gz

You can processes a folder of Illumina data like this:

amptk illumina -i folder_name -o miseqData -f fITS7 -r ITS4
amptk illumina -i folder_name -o miseqData -m mapping_file.txt

This will find all files ending with '.fastq.gz' in the input folder, gunzip the files, and then sequentially process the paired read files. The script will run -fastq_mergepairs from USEARCH9 or VSEARCH (dependent on --merge_method) , however, since some ITS sequences are too long to overlap, the default setting is to rescue longer sequences by recovering the the non-merged forward reads. (to turn off use --rescue_forward off ). Alternatively, you can only utilize the forward reads (R1), by passing the --reads forward argument. Next the forward and reverse primers are removed and the reads are trimmed/padded to a set length of clustering (controlled by the --pad option). Finally, the resulting FASTQ files for each of the processed samples are concatenated together into a file called miseqData.demux.fq.gz that will be used for the next clustering step. The script will also output a text file called miseqData-filenames.txt that contains a tab-delimited output of the sample name as well as [i5] and [i7] index sequences that were used. The script will produce a folder containing the individual de-multiplexed files named from the -o, --out argment.

OTU Picking:

After samples are pre-processed, the next step is to pick OTUs from the data. AMPtk can do this by using UPARSE, UNOISE2, DADA2, or reference based clustering. UPARSE is run using the amptk cluster command, which expects de-multiplexed FASTQ (or GZIPPED FASTQ) data as a single file with ;barcodelabel=Sample_name in the FASTQ header. If your data is in some other format, you can use other UNIX/Perl/Python scripts to add the barcodelabel= to each read and then cluster your data using AMPtk. Note that reads should have primers trimmed from them and can optionally be globally trimmed by padding with N's to a set length. For variable length amplicons, you should NOT discard short sequences that contain both a forward and reverse primer. AMPtk take steps to ensure high quality data makes it into the clustering algorithm with minimal sequence loss via the pre-processing steps. Now the data from either platform (Ion, 454, or Illumina) can be clustered by running the following:

amptk cluster -i amptk.demux.fq -o ion_output

This script wil quality filter the data based on expected errors, then remove duplicated sequences (dereplication), sort the output by abundance, and finally cluster using usearch -cluster_otus command. You can also optionally run UCHIME Reference filtering by adding the --uchime_ref ITS option. Type -h for all the available options.

DADA2 and UNOISE2 "denoising":

Recently there is a new "OTU picking" algorithm for amplicon based datasets called DADA2 that has sensitivity down to single base pairs, see publication here, GitHub here. This algorithm uses a statistical method to infer the original sequence that a read was derived from, foregoing the need to cluster at a set threshold (i.e. 97%). Instead of clustering into OTUs, the algorithm finds "exact sequence variants" or "inferred sequences" (iSeqs). I've implemented a modified DADA2 pipeline here to work with the current AMPtk data structure. A reminder is that reads for DADA2 must have no N's. DADA2 v1.4 and greater support variable length amplicons. You can run it as follows:

amptk dada2 -i amptk.demux.fq.gz -o dada2_output

The script will quality filter your data, trim for use in DADA2, run DADA2 alogrithm, and then parse the results to output an OTU table and a file containing inferred sequences (OTUs) in fasta format. These files can be used in all downstream AMPtk scripts, i.e. amptk filter and amptk taxonomy. A similar algorithm has been employed in USEARCH called UNOISE2, you can run that as follows:

amptk unoise2 -i amptk.demux.fq.gz -o unoise2_output

OTU Table Filtering

The data may need some additional filtering if you included a spike-in control mock community. The advantage is that you know what should be in the spike-in control barcode sample, thus you can modify USEARCH9 clustering parameters that give you reasonable results. If you need to trim your OTU table by some threshold, i.e. several OTUs at low abundance are showing up in your spike-in control sample that represent contamination or sequence error - you can set a threshold and filter the OTU table. This is done with the following script:

amptk filter -i test.otu_table.txt -f test.final.otus.fa -b mock3 --mc my_mock_seqs.fa

This will read the OTU table -i and the OTUs -f from the amptk cluster command. This script will apply an index-bleed filter to clean-up barcode-switching between samples which happens at a rate of ~ 0.2% in Ion Torrent and as much 0.3% in MiSeq data. The script first normalizes the OTU table to the number of reads in each sample, then (optionally) using the -b sample, it will calculate the amount of index-bleed in the OTU table, finally it will loop through each OTU and change values to 0 that are below the -index_bleed filter. Finally, this script will remove the mock spike in control sample from your dataset - as it should not be included in downstream processing, you can keep mock sequences if desired by passing the --keep_mock argument. The output is a filtered OTU table to be used for downstream processing.

If you do not have a mock community spike in, you can still run the index bleed filter (and you probably should as nearly all NGS data has some degree of barcode switching or index-bleed) by just running the command without a -b argument, such as, which will apply a 0.5% filter on the data. Passing the -p 0.005 argument will over-ride the calculated index-bleed.

amptk filter -i test.otu_table.txt -f test.final.otus.fa -p 0.005

Assign Taxonomy:

You can assign taxonomy to your OTUs using AMPtk, either using UTAX from USEARCH9 or using usearch_global. The databases require some initial setup before you can use the amptk taxonomy command.

Issuing the amptk taxonomy command will inform you which databases have been properly configured as well as usage instructions:

$ amptk taxonomy
Usage:       amptk taxonomy <arguments>
version:     0.10.0

Description: Script maps OTUs to taxonomy information and can append to an OTU table (optional).  
             By default the script uses a hybrid approach, e.g. gets taxonomy information from 
             SINTAX, UTAX, and global alignment hits from the larger UNITE-INSD database, and 
             then parses results to extract the most taxonomy information that it can at 'trustable' 
             levels. SINTAX/UTAX results are used if BLAST-like search pct identity is less than 97%.  
             If % identity is greater than 97%, the result with most taxonomy levels is retained.
    
Arguments:   -f, --fasta         Input FASTA file (i.e. OTUs from amptk cluster) (Required)
             -i, --otu_table     Input OTU table file (i.e. otu_table from amptk cluster)
             -o, --out           Base name for output file. Default: amptk-taxonomy.<method>.txt
             -d, --db            Select Pre-installed database [ITS1, ITS2, ITS, 16S, LSU, COI]. Default: ITS2
             -m, --mapping_file  QIIME-like mapping file
             -t, --taxonomy      Taxonomy calculated elsewhere. 2 Column file.
             --method            Taxonomy method. Default: hybrid [utax, sintax, usearch, hybrid, rdp, blast]
             --fasta_db          Alternative database of fasta sequenes to use for global alignment.
             --utax_db           UTAX formatted database. Default: ITS2.udb [See configured DB's below]
             --utax_cutoff       UTAX confidence value threshold. Default: 0.8 [0 to 0.9]
             --usearch_db        USEARCH formatted database. Default: USEARCH.udb
             --usearch_cutoff    USEARCH threshold percent identity. Default 0.7
             --sintax_cutoff     SINTAX confidence value threshold. Default: 0.8 [0 to 0.9]
             -r, --rdp           Path to RDP Classifier. Required if --method rdp
             --rdp_db            RDP Classifer DB set. [fungalits_unite, fungalits_warcup. fungallsu, 16srrna]  
             --rdp_cutoff        RDP Classifer confidence value threshold. Default: 0.8 [0 to 1.0]
             --local_blast       Local Blast database (full path) Default: NCBI remote nt database   
             --tax_filter        Remove OTUs from OTU table that do not match filter, i.e. Fungi to keep only fungi.
             -u, --usearch       USEARCH executable. Default: usearch9
             --debug             Keep intermediate files

And then you can use the amptk taxonomy command to assign taxonomy to your OTUs as well as append them to your OTU table as follows:

#use of hybrid taxonomy approach
amptk taxonomy -f data.filtered.otus.fa -o output -i data.final.csv -m mapping_file.txt -d ITS2

#filter data to only include OTUs identified to Fungi
amptk taxonomy -f data.filtered.otus.fa -o output -i data.final.csv -m mapping_file.txt -d ITS2 --tax_filter Fungi

#use RDP classifier
amptk taxonomy -f data.filtered.otus.fa -o output -i data.final.csv -m mapping_file.txt --method rdp --rdp_db fungalits_unite -rdp /path/to/classifier.jar

Summarizing the Taxonomy:

After taxonomy is appended to your OTU table, you can then generate OTU-like tables for each of your samples at all of the levels of taxonomy (i.e. Kingdom, Phylum, Class, Order, Family, Genus). At the same time, you can create a QIIME-like stacked bar graph from these data.

amptk summarize -i data.taxonomy.otu_table.txt -o data-summary

The optional --graphs argument will create the stacked bar graphs. You can save in a variety of formats as well as convert the result to precent of total with the --percent argument.

Downstream Processing:

As of amptk v0.6.0, the output from the amptk taxonomy command will create a biom file that contains taxonomy compatible with QIIME, PHINCH, PhyloSeq, and MetaCoMET. The script will also output a phylogenetic tree from your OTUs which is required for some downstream analysis. Moreover, you can pass the -m, --mapping_file option to amptk taxonomy and all columns will be incorporated as sample metadata. A mapping file is created automatically for you in the pre-processing steps of amptk, such as amptk ion and amptk illumina. You can easily add your metadata to this file using something like excel, and then save as a tab delimited text file.

QIIME and PhyloSeq import instructions

Dependencies

Python 2.7
Biopython
USEARCH9
natsort
pandas
numpy
matplotlib
psutil
bedtools (only needed if using Ion Torrent BAM file as input)
vsearch (version > 2.2.0)
biom-format (to create biom OTU table)
h5py (for biom)
R (dada2)
dada2, ShortRead (these will be automatically installed on first usage of amptk dada2

Python and USEARCH need to accessible in PATH; alternatively you can pass in the variable -u /path/to/usearch9 to scripts requiring USEARCH9.

In order to draw a heatmap or stacked bar graph using amptk.py heatmap or amptk summarize you will need to have the following python libraries installed: matplotlib, pandas, numpy. They can be installed with pip, i.e. pip install matplotlib pandas numpy or conda, conda install matplotlib pandas numpy.

vmikk/ufits