AMPtk
NGS AMPlicon Tool Kit:
AMPtk is a series of scripts to process NGS amplicon data using USEARCH and VSEARCH, it can also be used to process any NGS amplicon data and includes databases setup for analysis of fungal ITS, fungal LSU, bacterial 16S, and insect COI amplicons. It can handle Ion Torrent, MiSeq, and 454 data. At least USEARCH v9.1.13 and VSEARCH v2.2.0 are required as of AMPtk v0.7.0.
Installation:
- Mac install instructions
- Linux install instructions
- Windows install instuructions - Note use on Windows is not recommended.
AMPtk Wrapper script
AMPtk comes with a wrapper script for ease of use. On UNIX, you can call it by simply typing amptk
, while on windows you may need to type amptk.py
(not needed if you add the .py
extension in your PATHEXT, directions here).
$ amptk
Usage: amptk <command> <arguments>
version: 0.10.0
Description: AMPtk is a package of scripts to process NGS amplicon data.
Dependencies: USEARCH v9.1.13 and VSEARCH v2.2.0
Process: ion pre-process Ion Torrent data
illumina pre-process folder of de-multiplexed Illumina data
illumina2 pre-process PE Illumina data from a single file
illumina3 pre-process PE Illumina + index reads (i.e. R1, R2, I)
454 pre-process Roche 454 (pyrosequencing) data
SRA pre-process single FASTQ per sample data (i.e. SRA data)
Clustering: cluster cluster OTUs (using UPARSE algorithm)
dada2 dada2 denoising algorithm (requires R, dada2, ShortRead)
unoise2 UNOISE2 denoising algorithm
cluster_ref closed/open reference based clustering (EXPERIMENTAL)
Utilities: filter OTU table filtering
taxonomy Assign taxonomy to OTUs
show show number or reads per barcode from de-multiplexed data
select select reads (samples) from de-multiplexed data
remove remove reads (samples) from de-multiplexed data
sample sub-sample (rarify) de-multiplexed reads per sample
drop Drop OTUs from dataset
summarize Summarize Taxonomy (create OTU-like tables and/or stacked bar graphs)
funguild Run FUNGuild (annotate OTUs with ecological information)
meta pivot OTU table and append to meta data
heatmap Create heatmap from OTU table
SRA-submit De-multiplex data and create meta data for NCBI SRA submission
Setup: install Download/install pre-formatted taxonomy DB. Only need to run once.
database Format Reference Databases for Taxonomy
primers List primers hard-coded in AMPtk. Can use in pre-processing steps.
And then by calling one of the commands, you get a help menu for each:
$ amptk cluster
Usage: amptk cluster <arguments>
version: 0.10.0
Description: Script is a "wrapper" for the UPARSE algorithm. FASTQ quality trimming via expected
errors and Dereplication are run in vsearch if installed otherwise defaults to Python
which allows for the use of datasets larger than 4GB.
Chimera filtering and UNOISE are also options.
Arguments: -i, --fastq Input FASTQ file (Required)
-o, --out Output base name. Default: out
-e, --maxee Expected error quality trimming. Default: 1.0
-p, --pct_otu OTU Clustering Radius (percent). Default: 97
-m, --minsize Minimum size to keep (singleton filter). Default: 2
--uchime_ref Run Ref Chimera filtering. Default: off [ITS, LSU, COI, 16S, custom path]
--map_filtered Map quality filtered reads back to OTUs. Default: off
--unoise Run De-noising pre-clustering (UNOISE). Default: off
--debug Keep intermediate files.
-u, --usearch USEARCH executable. Default: usearch9
Installing Databases:
AMPtk is pre-configured to deal with amplicons from fungal ITS, fungal LSU, bacterial 16S, and insect COI. You can see how the databases were constructed here. After installation of AMPtk, you can download and install these databases using the amptk install
command (to overwrite existing databases, use the --force
option. To install all of the databases, you would type:
#install all databases
amptk install -i ITS LSU 16S COI
#install only ITS databases
amptk install -i ITS
The resulting databases are stored in the DB
folder of the amptk
directory. These data are used for both Chimera reference filtering and for assigning taxonomy.
What processing script do I use??
You need to be familiar with your read structure! For example, do you have barcodes at the 5' end of your amplicons? Is the primer sequence in your reads or has it been removed by the sequencing software? I prefer that data be minimally processed - this acts largely as a quality control filter. For example when using Ion Torrent data (see instructions below) it is beneficial to turn off all filtering of the data on the server, by doing this you ensure that the reads are properly quality filtered and barcodes/primers are used as an additional quality filter. If you have PE MiSeq data, typically it comes in an already demultiplexed format, i.e. a folder of PE reads for each sample (_R1.fastq.gz, _R2.fastq.gz) - these data can be used directly with the amptk illumina
command. A somewhat common (although I think not ideal) approach is to use a custom sequencing primer with Illumina data - and then the output from the sequencer is already demulitplexed and primers are already stripped - in this case, you can use the amptk illumina
command, but make sure to pass the --require_primer off
option. However, I have seen several other formats of data as well - even with Illumina data being in a single file, where then the amptk illumina2
script is helpful. Bottom line is if you don't know how your data is structured in terms of barcodes, primers, reads, paired-ends, etc - you need to find out before doing anything else....
Processing Ion Torrent Data:
From the Torrent Server, analyze the data using the --disable-all-filters
BaseCaller argument. This will leave the adapters/key/barcode sequence intact. You can download either the unaligned BAM file from the server or a FASTQ file. You can then de-multiplex the data as follows:
amptk ion -i data.bam -o data -f fITS7 -r ITS4
amptk ion --barcodes 1,5,24 -i data.fastq.gz -o data -f ITS1-F -r ITS2
amptk ion --barcode_fasta my_barcodes.fa -i data.fastq -o data -f ITS1 -r ITS2
amptk ion -i data.bam -m mapping_file.txt -o data
This will find Ion barcodes (1, 5, and 24) and relabel header with that information (barcodelabel=BC_5;). By default, it will look for all 96 Ion Xpress barcodes, specifiy the barcodes you used by a comma separated list. You can also pass in a fasta file containing your barcode sequences with properly labeled headers. Next the script will find and trim both the forward and reverse primers (default is ITS2 region: fITS7 & ITS4), and then finally will trim or pad with N's to a set length (default: 250 bp). Trimming to the same length is critcally important for USEARCH to cluster correctly, padding with N's after finding the reverse primer keeps short ITS sequences from being discarded. These options can be customized using: --fwd_primer
, --rev_primer
, --trim_len
, etc.
Processing Roche 454 Data:
Data from 454 instruments has the same read structure as Ion Torrent: Read and thus can be processed very similarly. You just need to provide either an SFF file, FASTA + QUAL, or FASTQ files and then you need to specify a multi-fasta file containing the barcodes used in the project. The data will be processed in the same fashion (see above) as the Ion Torrent Data. For example:
#SFF input
amptk 454 -i data.sff --barcode_fasta my454barcodes.fa -o 454project
#FASTA/QUAL input
amptk 454 -i data.fa -q data.qual --barcode_fasta my454barcodes.fa -o 454project
Processing Illumina MiSeq PE Data:
Paired-end MiSeq data is typically delivered already de-multiplexed into separate read files, that have a defined naming structure from Illumina that looks like this:
<sample name>_<barcode sequence>_L<lane (0-padded to 3 digits)>_R<read number>_<set number (0-padded to 3 digits>.fastq.gz
You can processes a folder of Illumina data like this:
amptk illumina -i folder_name -o miseqData -f fITS7 -r ITS4
amptk illumina -i folder_name -o miseqData -m mapping_file.txt
This will find all files ending with '.fastq.gz' in the input folder, gunzip the files, and then sequentially process the paired read files. The script will run -fastq_mergepairs
from USEARCH9 or VSEARCH (dependent on --merge_method
) , however, since some ITS sequences are too long to overlap, the default setting is to rescue longer sequences by recovering the the non-merged forward reads. (to turn off use --rescue_forward off
). Alternatively, you can only utilize the forward reads (R1), by passing the --reads forward
argument. Next the forward and reverse primers are removed and the reads are trimmed/padded to a set length of clustering (controlled by the --pad
option). Finally, the resulting FASTQ files for each of the processed samples are concatenated together into a file called miseqData.demux.fq.gz
that will be used for the next clustering step. The script will also output a text file called miseqData-filenames.txt
that contains a tab-delimited output of the sample name as well as [i5] and [i7] index sequences that were used. The script will produce a folder containing the individual de-multiplexed files named from the -o, --out
argment.
OTU Picking:
After samples are pre-processed, the next step is to pick OTUs from the data. AMPtk can do this by using UPARSE, UNOISE2, DADA2, or reference based clustering. UPARSE is run using the amptk cluster
command, which expects de-multiplexed FASTQ (or GZIPPED FASTQ) data as a single file with ;barcodelabel=Sample_name
in the FASTQ header. If your data is in some other format, you can use other UNIX/Perl/Python scripts to add the barcodelabel=
to each read and then cluster your data using AMPtk. Note that reads should have primers trimmed from them and can optionally be globally trimmed by padding with N's to a set length. For variable length amplicons, you should NOT discard short sequences that contain both a forward and reverse primer. AMPtk take steps to ensure high quality data makes it into the clustering algorithm with minimal sequence loss via the pre-processing steps. Now the data from either platform (Ion, 454, or Illumina) can be clustered by running the following:
amptk cluster -i amptk.demux.fq -o ion_output
This script wil quality filter the data based on expected errors, then remove duplicated sequences (dereplication), sort the output by abundance, and finally cluster using usearch -cluster_otus
command. You can also optionally run UCHIME Reference filtering by adding the --uchime_ref ITS
option. Type -h
for all the available options.
DADA2 and UNOISE2 "denoising":
Recently there is a new "OTU picking" algorithm for amplicon based datasets called DADA2 that has sensitivity down to single base pairs, see publication here, GitHub here. This algorithm uses a statistical method to infer the original sequence that a read was derived from, foregoing the need to cluster at a set threshold (i.e. 97%). Instead of clustering into OTUs, the algorithm finds "exact sequence variants" or "inferred sequences" (iSeqs). I've implemented a modified DADA2 pipeline here to work with the current AMPtk data structure. A reminder is that reads for DADA2 must have no N's. DADA2 v1.4 and greater support variable length amplicons. You can run it as follows:
amptk dada2 -i amptk.demux.fq.gz -o dada2_output
The script will quality filter your data, trim for use in DADA2, run DADA2 alogrithm, and then parse the results to output an OTU table and a file containing inferred sequences (OTUs) in fasta format. These files can be used in all downstream AMPtk scripts, i.e. amptk filter
and amptk taxonomy
. A similar algorithm has been employed in USEARCH called UNOISE2, you can run that as follows:
amptk unoise2 -i amptk.demux.fq.gz -o unoise2_output
OTU Table Filtering
The data may need some additional filtering if you included a spike-in control mock community. The advantage is that you know what should be in the spike-in control barcode sample, thus you can modify USEARCH9 clustering parameters that give you reasonable results. If you need to trim your OTU table by some threshold, i.e. several OTUs at low abundance are showing up in your spike-in control sample that represent contamination or sequence error - you can set a threshold and filter the OTU table. This is done with the following script:
amptk filter -i test.otu_table.txt -f test.final.otus.fa -b mock3 --mc my_mock_seqs.fa
This will read the OTU table -i
and the OTUs -f
from the amptk cluster
command. This script will apply an index-bleed filter to clean-up barcode-switching between samples which happens at a rate of ~ 0.2% in Ion Torrent and as much 0.3% in MiSeq data. The script first normalizes the OTU table to the number of reads in each sample, then (optionally) using the -b
sample, it will calculate the amount of index-bleed in the OTU table, finally it will loop through each OTU and change values to 0 that are below the -index_bleed
filter. Finally, this script will remove the mock spike in control sample from your dataset - as it should not be included in downstream processing, you can keep mock sequences if desired by passing the --keep_mock
argument. The output is a filtered OTU table to be used for downstream processing.
If you do not have a mock community spike in, you can still run the index bleed filter (and you probably should as nearly all NGS data has some degree of barcode switching or index-bleed) by just running the command without a -b
argument, such as, which will apply a 0.5% filter on the data. Passing the -p 0.005
argument will over-ride the calculated index-bleed.
amptk filter -i test.otu_table.txt -f test.final.otus.fa -p 0.005
Assign Taxonomy:
You can assign taxonomy to your OTUs using AMPtk, either using UTAX from USEARCH9 or using usearch_global. The databases require some initial setup before you can use the amptk taxonomy
command.
Issuing the amptk taxonomy
command will inform you which databases have been properly configured as well as usage instructions:
$ amptk taxonomy
Usage: amptk taxonomy <arguments>
version: 0.10.0
Description: Script maps OTUs to taxonomy information and can append to an OTU table (optional).
By default the script uses a hybrid approach, e.g. gets taxonomy information from
SINTAX, UTAX, and global alignment hits from the larger UNITE-INSD database, and
then parses results to extract the most taxonomy information that it can at 'trustable'
levels. SINTAX/UTAX results are used if BLAST-like search pct identity is less than 97%.
If % identity is greater than 97%, the result with most taxonomy levels is retained.
Arguments: -f, --fasta Input FASTA file (i.e. OTUs from amptk cluster) (Required)
-i, --otu_table Input OTU table file (i.e. otu_table from amptk cluster)
-o, --out Base name for output file. Default: amptk-taxonomy.<method>.txt
-d, --db Select Pre-installed database [ITS1, ITS2, ITS, 16S, LSU, COI]. Default: ITS2
-m, --mapping_file QIIME-like mapping file
-t, --taxonomy Taxonomy calculated elsewhere. 2 Column file.
--method Taxonomy method. Default: hybrid [utax, sintax, usearch, hybrid, rdp, blast]
--fasta_db Alternative database of fasta sequenes to use for global alignment.
--utax_db UTAX formatted database. Default: ITS2.udb [See configured DB's below]
--utax_cutoff UTAX confidence value threshold. Default: 0.8 [0 to 0.9]
--usearch_db USEARCH formatted database. Default: USEARCH.udb
--usearch_cutoff USEARCH threshold percent identity. Default 0.7
--sintax_cutoff SINTAX confidence value threshold. Default: 0.8 [0 to 0.9]
-r, --rdp Path to RDP Classifier. Required if --method rdp
--rdp_db RDP Classifer DB set. [fungalits_unite, fungalits_warcup. fungallsu, 16srrna]
--rdp_cutoff RDP Classifer confidence value threshold. Default: 0.8 [0 to 1.0]
--local_blast Local Blast database (full path) Default: NCBI remote nt database
--tax_filter Remove OTUs from OTU table that do not match filter, i.e. Fungi to keep only fungi.
-u, --usearch USEARCH executable. Default: usearch9
--debug Keep intermediate files
And then you can use the amptk taxonomy
command to assign taxonomy to your OTUs as well as append them to your OTU table as follows:
#use of hybrid taxonomy approach
amptk taxonomy -f data.filtered.otus.fa -o output -i data.final.csv -m mapping_file.txt -d ITS2
#filter data to only include OTUs identified to Fungi
amptk taxonomy -f data.filtered.otus.fa -o output -i data.final.csv -m mapping_file.txt -d ITS2 --tax_filter Fungi
#use RDP classifier
amptk taxonomy -f data.filtered.otus.fa -o output -i data.final.csv -m mapping_file.txt --method rdp --rdp_db fungalits_unite -rdp /path/to/classifier.jar
Summarizing the Taxonomy:
After taxonomy is appended to your OTU table, you can then generate OTU-like tables for each of your samples at all of the levels of taxonomy (i.e. Kingdom, Phylum, Class, Order, Family, Genus). At the same time, you can create a QIIME-like stacked bar graph from these data.
amptk summarize -i data.taxonomy.otu_table.txt -o data-summary
The optional --graphs
argument will create the stacked bar graphs. You can save in a variety of formats as well as convert the result to precent of total with the --percent
argument.
Downstream Processing:
As of amptk v0.6.0
, the output from the amptk taxonomy
command will create a biom file that contains taxonomy compatible with QIIME, PHINCH, PhyloSeq, and MetaCoMET. The script will also output a phylogenetic tree from your OTUs which is required for some downstream analysis. Moreover, you can pass the -m, --mapping_file
option to amptk taxonomy
and all columns will be incorporated as sample metadata. A mapping file is created automatically for you in the pre-processing steps of amptk, such as amptk ion
and amptk illumina
. You can easily add your metadata to this file using something like excel, and then save as a tab delimited text file.
QIIME and PhyloSeq import instructions
Dependencies
- Python 2.7
- Biopython
- USEARCH9
- natsort
- pandas
- numpy
- matplotlib
- psutil
- bedtools (only needed if using Ion Torrent BAM file as input)
- vsearch (version > 2.2.0)
- biom-format (to create biom OTU table)
- h5py (for biom)
- R (dada2)
- dada2, ShortRead (these will be automatically installed on first usage of
amptk dada2
Python and USEARCH need to accessible in PATH; alternatively you can pass in the variable -u /path/to/usearch9
to scripts requiring USEARCH9.
In order to draw a heatmap or stacked bar graph using amptk.py heatmap
or amptk summarize
you will need to have the following python libraries installed: matplotlib, pandas, numpy
. They can be installed with pip, i.e. pip install matplotlib pandas numpy
or conda, conda install matplotlib pandas numpy
.