An automated pipeline for analyses of fungal internal transcribed spacer (ITS) sequences from the Illumina sequencing platform (Gweon et al., 2015)
Shown to perform better than QIIME2 - See this paper
Some significant changes!
- PIPITS now classifies sequences against UNITE 9.0 (205,888 fungi & 326,300 Eukaryotes - see below).
- The database now includes non-fungi (i.e. Eukaryotes) to ensure that the infamous OTUs with a mere "k__Fungi" could be better classified. With the inclusion, not these OTUs can indeed be "k__Fungi", "k__Viridiplantae" or "k__unidentified". Do note that depending on your choice of primers, you may pick up quite a lot of plant ITS sequences (no primers are perfectly specific for just fungi).
- However, because of the significant increase in the size of the database, PIPITS now requires at least 16GB of RAM (preferably more e.g. 32GB). This may not suite those who used to enjoy running PIPITS on their laptop. Sorry... time has moved on!
- Also the increase in the size of the database meant that RDP Classifier can take a very long time to process the data. For this reason, you now have an option to run SINTAX (VSEARCH) to assign sequences. This is remarkably quick!
- If you find that RDP Classifier is taking too long, please use "--taxassignmentmethod sin" to just run SINTAX (VSEARCH). That said, the confidence threshold of 0.85 doesn't equates 0.85 of RDP Classifier though from my experience, the differences are small. Do note that SINTAX is a non-Bayesian taxonomic classifier.
- I will look to incorporate other classifier such as CONSTAX in the future!
- UNITE 8.3 added. PIPITS now classifies sequences against UNITE 8.3 (98,090 sequences)
- WARCUP phylotype table bug fixed. It now produces correcly aggregated table (it used to aggregate at the Family-level, but now it aggregates at the Species-level)
- BIOM to phylotype table bug fixed. After BIOM (one of the dependencies) was upgraded, phylotype table inadvertently got filled with normalised values. This now has been remedied, and it's now back to the previous behaviour. For those who just want to convert OTU tables to phylotype tables without re-running PIPITS again, please update PIPITS, and (within pipits_env) then:
pipits_phylotype_biom -i otu_table.biom -o phylotype_table.txt -l 6
- New UNITE DB (released on 2020-02-04). PIPITS will now download the new UNITE db. Also few minor bugs have now been fixed.
- BIOM files are now in the HDF5 format. OTU tables in BIOM format is now in HDF5 rather than JSON format. OTU tables in HDF5 BIOM are supported by PHYLOSEQ and QIIME2.
- PIPITS_PROCESS automatically downloads UNITE database (the most recent version), so there is no need to meddle with environment variables anymore. Just run commands and it will take care of the database issues. You can still use older database by the way using --unite option (see help by -h).
- PIPITS_FUNITS exploits multiple CPUs. It's an experimental feature, so do use it with care. You can invoke to use multiple CPUs by using the usual
-t NUMBER_OF_CPUS
option.- Update PIPITS with
conda update --channel bioconda --channel conda-forge --channel defaults pipits
then check you have version 2.3 installed by:conda list pipits
-
is an automated pipeline for analyses of fungal internal transcribed spacer (ITS) sequences from the Illumina sequencing platform.
-
only works on POSIX systems (this essentiallly means it doesn't work in Windows - sorry...).
-
will need at least 16 GB of RAM on your machine running 64-bit Linux of mac OS.
-
Automatically downloads the most recent version of UNITE fungal db (and also comes with an option to run it against WARCUP fungal db).
-
Just 4 commands, and you are good to go!
It is recommended that you use a conda environment for running PIPITS to ensure that its dependencies are contained in this "sandbox". This meant that you don't mess with your existig system and you don't need to be the admin. Don't worry, it's easy - just type the following command.
EXPLANATION: install PIPITS and dependencies and create a Conda environment (here the environment is named "pipit_env" but you can choose any name you wish). PIPITS is exclusively compatible with Python3, so add "python=3.6" as below:
conda create -n pipits_env --channel bioconda --channel conda-forge --channel defaults python=3.6 pipits
The PIPITS is divided into three consequential parts:
- Prepping raw sequences: join, convert, quality filter etc.
- Fungal ITS extraction: remove conserved regions
- Process the reads to produce an OTU abundance table and the taxonomic assignment table for downstream analysis
Let's test it with a very small test dataset to ensure everything is set up correcly.
EXPLANATION: Download & extract a test dataset
wget https://sourceforge.net/projects/pipits/files/PIPITS_TESTDATA/pipits_test.tar.gz -O pipits_test.tar.gz
tar xvfz pipits_test.tar.gz
EXPLANATION: Get into the Conda environment you've just created, and run PIPITS.
cd pipits_test
conda activate pipits_env
pispino_createreadpairslist -i rawdata -o readpairslist.txt
pispino_seqprep -i rawdata -o out_seqprep -l readpairslist.txt
pipits_funits -i out_seqprep/prepped.fasta -o out_funits -x ITS2 -v -r
pipits_process -i out_funits/ITS.fasta -o out_process -v -r
Some rare setups (e.g., installation in user-level folders of dated server distributions) cause pipits_process
to fail while converting to biom format. The issue can be solved by updating the fresh installation from within the environment: conda update pipits
.
Illumina reads are generally provided as demultiplexed FASTQ files where the Illumina software (BASESPACE) splits the reads into separate files, one for each barcode.
EXPLANATION: PISPINO (originally part of PIPITS) provides a script called
pispino_createreadpairslist
which generates a tab-delimited text file for all read-pairs from the directory containing your raw sequences
pispino_createreadpairslist -i rawdata -o readpairslist.txt
- The command produces a tab-delimited file with three columns denoting forward and reverse read filenames and sample IDs for the pairs
- Prior to running the command, you need to ensure that the raw data are either uncompressed (β.fastqβ), or compressed with bz2 or gz (β.fastq.bz2β, β.fastq.gzβ). Sample IDs are taken from the first characters preceding an underscore (β_β) from each filename
- After running pispino_createreadpairslist, check the resulting file ("readpairslist.txt") to see correct filenames and desired sample IDs are listed in the resulting file ("readpairslist.txt"). No duplicate sample IDs are allowed.
EXPLANATION: Once we have the list file ("readpairslist.txt"), we can then begin to "prepare" the sequences:
pispino_seqprep -i rawdata -o out_seqprep -l readpairslist.txt
- Read-pairs are joined by examining the overlapping regions of sequences
- The resulting assembled reads are then quality filtered
- The header of each read is then relabelled with an index number followed by a Sample ID
- The resulting files are converted into FASTA and merged into a single file to produce the final output file "prepped.fasta" in the output directory
The output from pipits_prep is taken as an input for this step. It is also mandatory to provide the script with which ITS subregion (i.e. ITS1 or ITS2) is to be extracted.
EXPLANATION: the input file (indicated with "-i") is the resulting file from the previous step
pipits_funits -i out_seqprep/prepped.fasta -o out_funits -x ITS2
- Selected subregion are extracted with ITSx and where necessary they are re-orientated to 5β to 3β direction. It is worth noting that ITSx uses HMMER3 (Mistry et al., 2013) to compare input sequences against a set of models built from a number of different subregions of ITS sequences found in various organisms. This makes ITSx an ideal tool for both extraction of desired ITS subregions as well as filtering for specific groups of organisms. It also means that while PIPITS has been created with the analysis of fungal amplicons in mind, it could be adapted for the analyses of other organism groups where ITS is used as a marker by changing the ITSx settings and reference databases
- Having extracted the subregion, sequences are re-inflated to reflect their original abundances. To date, the longest sequenceable reads from the Illumina technology are 300 bp x 2 which is not sufficient to sequence both ITS1 and ITS2 and to have an overlapping region to join them. For this reason the program supports only a single subregion extraction mode
- PIPITS will include those sequences that do not have any conserved region detected. This is so that ALL sequences are taken into account.
EXPLANATION: This is the final step involving clustering and assigning of taxonomy.
pipits_process -i out_funits/ITS.fasta -o out_process
- Input sequences are dereplicated
- Short (< 100bp) and unique (singletons) are removed
- The sequences are clustered at 97% PID
- The resulting representative sequences for each cluster are subjected to chimera detection and removal
- The input sequences are mapped onto the chimera-free representative sequences at 97% PID
- The representatives are taxonomically assigned with RDP Classifier against the UNITE fungal ITS reference dataset
- The results are translated into two types of OTU abundance
tables:
- βOTU abundance tableβ, an OTU is defined as a cluster of reads with the user-defined threshold typically 97% sequence identity motivated by the expectation that these correspond approximately to species.
- βphylotype abundance tableβ, an OTU is defined as a cluster of sequences binned into the same taxonomic assignments.
- If you have memory issues, try increasing the maximum memory with "--Xmx". For example, "--Xmx 4G".
- Once all finished, you can leave Conda environment by typeing
conda deactivate
You can tweak parameters and there are several options for each of the above steps. To view them, type "-h" after each command.
pipits_prep -h
Run pipits_funguild.py on the resulting OTU table to have a reformatted version for FUNGuild analysis. See their page for more detail.
pipits_funguild.py -i out_process/otu_table.txt -o out_process/otu_table_funguild.txt
Please cite:
Hyun S. Gweon, Anna Oliver, Joanne Taylor, Tim Booth, Melanie Gibbs, Daniel S. Read, Robert I. Griffiths and Karsten Schonrogge, PIPITS: an automated pipeline for analyses of fungal internal transcribed spacer sequences from the Illumina sequencing platform, Methods in Ecology and Evolution, DOI: 10.1111/2041-210X.12399