Assessment of metabarcoding data pipelines : The goal of this study is to compare the performance of several bioinformatic tools to analyze fish eDNA data. Six steps of the analysis were identified, from the assembly of paired-end reads to the taxonomic assignation. For each of these steps, different programs to compare were identified, as shown below :
For each step, all the programs are compared, while the start and the end of the pipeline are standardized with a reference pipeline (obitools_reference). This pipeline is based on Obitools, a set of python programs designed to analyse Next Generation Sequencer outputs (illumina) in the context of DNA Metabarcoding.
The optimal pipeline obtained will be again compared to existant complete pipelines (QIIME2 and BARQUE).
The comparison is made first on simulated data (benchmark_simulated_dataset), to be able to assess the performance of each program to retrieve the samples taxonomic composition. The comparison is also verified on a real dataset (benchmark_real_dataset), composed of samples from Carry-le-Rouet reserve, France.
To install all the programs used in this study, please follow the instructions on their installation pages : ObiTools, VSEARCH, PEAR, FLASH, CASPER, fastq-join, NGmerge, fastp, cutadapt, Prinseq, SWARM and SINTAX.
The installation guidelines for the complete pipelines can be found here : QIIME2 and BARQUE.
Other possibility : all the programs have been installed in a Singularity container, which you can access here :
First you need to install Singularity.
We provide ready to run versions of Singularity containers.
Our complete collection of singularity recipes is available here.
Download containers in 99_utils/containers/
:
- ObiTools's container :
singularity pull --name obitools.simg shub://Grelot/bioinfo_singularity_recipes:obitools
- container with vsearch, PEAR, FLASh, CASPER, cutadapt, fastq-join, Tally, Prinseq, usearch (sintax), SWARM, Flexbar, Python2 :
singularity pull --name ednatools.simg shub://Grelot/bioinfo_singularity_recipes:ednatools
Input FASTQ files for the simulated dataset and the real dataset along with the reference database files needed for assignment with OBITools, too heavy to be stored here, will be stored on Datadryad upon publication. Information will be given here.
The sample description file, containing the primers and the tags associated to each sample, is available here for the simulated dataset, and here for the real dataset.
Thanks to the simulated dataset, we know exactly the relative abundance of each species in each sample and replicate. These data can be found here and will be compared to the output of each pipeline tested to assess their efficiency.
(Note that the input FASTQ files and the abundance data will change each time you run a grinder simulation. The files given here correspond to the grinder simulation made to obtain the data for our program comparison)
To assess the efficiency of each program, we measure the execution time (among other metrics).
Each time you test a different program for a given analysis step, you can record the time, memory usage, and CPU usage of this command by running time
in front of the command :
/usr/bin/time command [-options]
This will give this output in the standard error file :
where :
%elapsed = time in hours:minutes:seconds
Other performance metrics will be calculated for each pipeline tested : sensitivity and F-measure index will be calculated from the number of true positive, false positive and false negative output by each pipeline. Relative abundances output by each pipeline are compared to the expected abundances (species_abundance_per_sample).
To run any script, run this command line :
bash 99_utils/submitjob_sge_cluster/bash2sge.sh SCRIPT.sh
qsub
Before running any script of the analysis, make sure the config file is up to date, with the right path for each file. These config files are in the 98_infos
folder of each benchmark folder (here for the simulated dataset and here for the real dataset).
For simplicity, each pipelines are separated in folders corresponding to the steps, in 01_merging, 02_demultiplex, 03_dereplication, 04_filtering, 05_error and 06_assignation.
The first step consists in assembling forward and reverse reads of the sequences. We tested several assemblers, with no specific parameters.
01_merging contains the scripts to run each of these programs.
Once the reads assembled, the primers are removed (max. 2 mismatches allowed by primers). The tags are also removed (no mismatch allowed) and each read is assigned to the sample it comes from.
02_demultiplex contains the scripts to run each programs used at this step.
The reads are then dereplicated: identical reads are gathered in a unique read and the count is saved.
All the scripts to run the different programs are in 03_dereplication.
Reads are then checked for their quality : sequences longer than 20bp and with no ambiguous bases are kept.
The scripts to run the different programs are in 04_filtering
Each program or pipeline offers different tools to remove PCR or sequencing errors. For ObiTools, the program obiclean keeps only the head sequences for each PCR replicate.
The scripts for the different programs are in 05_error.
The last step of the analysis is to assign every sequence to a taxa. In our case, we use a homemade reference database. To be assigned at the species level, the query sequence must be at least similar at 98% to the reference sequence.
06_assignation contains the scripts to run the different assigning programs.
The outputs of each pipeline tested can be found in the Outputs
folder, in the folder corresponding to the step tested, under the name of the program tested.
The main
folder contains all the intermediate files produced by the pipeline. The final
folder contains the taxonomic table.
For example, to find the results of the pipeline testing the program flash for merging reads : 01_merging/Outputs/02_flash/final/merging_flash.csv
Time reports for each program compared are stored here.
The most performant programs for each step were assembled into a pipeline, and compared to the other complete pipelines. The scripts for this pipeline is in benchmark_simulated_dataset/07_complete_pipelines/assembled_pipeline.
The assembled pipeline built from the most perfomant individual programs is compared to other complete pipelines : BARQUE and QIIME2. The codes used to run BARQUE can be found here : Barque-1.6.2 or at https://github.com/enormandeau/barque/releases/tag/v1.6.2. The codes used to run QIIME2 are available here : QIIME2.
The comparison adapted for a real dataset can be found in benchmark_real_dataset, and is built with the same structure as with simulated data.