benchmark_pipelines

Assessment of metabarcoding data pipelines : The goal of this study is to compare the performance of several bioinformatic tools to analyze fish eDNA data. Six steps of the analysis were identified, from the assembly of paired-end reads to the taxonomic assignation. For each of these steps, different programs to compare were identified, as shown below :

For each step, all the programs are compared, while the start and the end of the pipeline are standardized with a reference pipeline (obitools_reference). This pipeline is based on Obitools, a set of python programs designed to analyse Next Generation Sequencer outputs (illumina) in the context of DNA Metabarcoding.

The optimal pipeline obtained will be again compared to existant complete pipelines (QIIME2 and BARQUE).

The comparison is made first on simulated data (benchmark_simulated_dataset), to be able to assess the performance of each program to retrieve the samples taxonomic composition. The comparison is also verified on a real dataset (benchmark_real_dataset), composed of samples from Carry-le-Rouet reserve, France.

Dependencies

BASH
- openssl
- wget
- curl
Singularity

Installation

Install from source code

To install all the programs used in this study, please follow the instructions on their installation pages : ObiTools, VSEARCH, PEAR, FLASH, CASPER, fastq-join, NGmerge, fastp, cutadapt, Prinseq, SWARM and SINTAX.

The installation guidelines for the complete pipelines can be found here : QIIME2 and BARQUE.

Singularity containers

Other possibility : all the programs have been installed in a Singularity container, which you can access here :

First you need to install Singularity.

We provide ready to run versions of Singularity containers.

Our complete collection of singularity recipes is available here.

Download containers in 99_utils/containers/ :

ObiTools's container :

singularity pull --name obitools.simg shub://Grelot/bioinfo_singularity_recipes:obitools

container with vsearch, PEAR, FLASh, CASPER, cutadapt, fastq-join, Tally, Prinseq, usearch (sintax), SWARM, Flexbar, Python2 :

singularity pull --name ednatools.simg shub://Grelot/bioinfo_singularity_recipes:ednatools

Data requirement

Input FASTQ files and Reference database

Input FASTQ files for the simulated dataset and the real dataset along with the reference database files needed for assignment with OBITools, too heavy to be stored here, will be stored on Datadryad upon publication. Information will be given here.

Sample description file

The sample description file, containing the primers and the tags associated to each sample, is available here for the simulated dataset, and here for the real dataset.

Abundance data

Thanks to the simulated dataset, we know exactly the relative abundance of each species in each sample and replicate. These data can be found here and will be compared to the output of each pipeline tested to assess their efficiency.

(Note that the input FASTQ files and the abundance data will change each time you run a grinder simulation. The files given here correspond to the grinder simulation made to obtain the data for our program comparison)

Performance measures

To assess the efficiency of each program, we measure the execution time (among other metrics).

Each time you test a different program for a given analysis step, you can record the time, memory usage, and CPU usage of this command by running time in front of the command :

/usr/bin/time command [-options]

This will give this output in the standard error file :

where :

%elapsed = time in hours:minutes:seconds

Other performance metrics will be calculated for each pipeline tested : sensitivity and F-measure index will be calculated from the number of true positive, false positive and false negative output by each pipeline. Relative abundances output by each pipeline are compared to the expected abundances (species_abundance_per_sample).

To run any script, run this command line :

bash 99_utils/submitjob_sge_cluster/bash2sge.sh SCRIPT.sh
qsub

Configuration of paths

Before running any script of the analysis, make sure the config file is up to date, with the right path for each file. These config files are in the 98_infos folder of each benchmark folder (here for the simulated dataset and here for the real dataset).

Analysis steps

For simplicity, each pipelines are separated in folders corresponding to the steps, in 01_merging, 02_demultiplex, 03_dereplication, 04_filtering, 05_error and 06_assignation.

1 - Merging paired-end reads

The first step consists in assembling forward and reverse reads of the sequences. We tested several assemblers, with no specific parameters.

01_merging contains the scripts to run each of these programs.

2 - Demultiplexing

Once the reads assembled, the primers are removed (max. 2 mismatches allowed by primers). The tags are also removed (no mismatch allowed) and each read is assigned to the sample it comes from.

02_demultiplex contains the scripts to run each programs used at this step.

3 - Dereplicating

The reads are then dereplicated: identical reads are gathered in a unique read and the count is saved.

All the scripts to run the different programs are in 03_dereplication.

4 - Quality filtering

Reads are then checked for their quality : sequences longer than 20bp and with no ambiguous bases are kept.

The scripts to run the different programs are in 04_filtering

5 - PCR / Sequencing error removal

Each program or pipeline offers different tools to remove PCR or sequencing errors. For ObiTools, the program obiclean keeps only the head sequences for each PCR replicate.

The scripts for the different programs are in 05_error.

6 - Taxonomic assignation

The last step of the analysis is to assign every sequence to a taxa. In our case, we use a homemade reference database. To be assigned at the species level, the query sequence must be at least similar at 98% to the reference sequence.

06_assignation contains the scripts to run the different assigning programs.

Outputs

Taxa/sample tables

The outputs of each pipeline tested can be found in the Outputs folder, in the folder corresponding to the step tested, under the name of the program tested. The main folder contains all the intermediate files produced by the pipeline. The final folder contains the taxonomic table. For example, to find the results of the pipeline testing the program flash for merging reads : 01_merging/Outputs/02_flash/final/merging_flash.csv

Time and memory reports

Time reports for each program compared are stored here.

Optimized pipeline

The most performant programs for each step were assembled into a pipeline, and compared to the other complete pipelines. The scripts for this pipeline is in benchmark_simulated_dataset/07_complete_pipelines/assembled_pipeline.

Comparison with complete pipelines

The assembled pipeline built from the most perfomant individual programs is compared to other complete pipelines : BARQUE and QIIME2. The codes used to run BARQUE can be found here : Barque-1.6.2 or at https://github.com/enormandeau/barque/releases/tag/v1.6.2. The codes used to run QIIME2 are available here : QIIME2.

Real dataset

The comparison adapted for a real dataset can be found in benchmark_real_dataset, and is built with the same structure as with simulated data.

lmathon/eDNA--benchmark_pipelines