This pipeline provides some basic quality controls of a collection of barcode sequences for Metabarcoding experiments: taxid-wise dereplication, Hamming distance and sequence size distribution.
BAnalyzer runs in a UNIX environment with BASH (tested on Debian GNU/Linux 10 (buster)) and requires conda and an internet connection (at least for the first run).
Start by getting a copy of this repository on your system, either by downloading and unpacking the archive, or using 'git clone':
cd path/to/repo/
git clone --recurse-submodules https://github.com/CVUA-RRW/BAnalyzer.git
Set up a conda environment containing snakemake (version 6 not supported), python and the pandas library and activate it:
conda create --name snakemake -c bioconda -c anaconda "snakemake>=5.10,<6.0" pandas
conda activate snakemake
To run the pipeline you will need to provide a BLAST-formated reference sequence database. If you already have a fasta file with your sequences follow the BLAST documentation to know how to format it.
If you want to extract barcodes from a database of reference genomes you can check out our RRW-PrimerBLAST pipeline.
You will also need to provide the taxdb and taxdump files available from the NCBI server.
BAnalyzer should be run using the snakemake command-line application. For this you will need to manually fill the config.yaml file with the paths to the required files. You can also modify the parameters already present in the file.
Then run the pipeline with:
snakemake -s /path/to/BAnalyzer/Snakefile --configfile path/to/config.yaml --use-conda --conda-prefix path/to/your/conda/envs
Consult snakemake's documentation for more details.
The configuration file contains the following parameters:
# Fill in the path below with your own specifications:
workdir: # Path to output directory
blast_db: # Path to BLAST-formated database
taxdb: # Path to the folder containing the taxdb files
rankedlineage_dmp: # Path to rankedlineage.dmp
nodes_dmp: # Path to nodes.dmp
# Modify the parameters below:
trim_primers: False # True to trim primers from sequences
primers: None # Path to the fasta file containing primer sequences, required only if trim_primers is True
min_identity: 0.9 # Minimal identity level to compute Hamming distance (real between 0 and 1)
max_n: 3 # Maximum number of N nucleotides allowed per sequence, sequences with more will be discarded
If choosing to trim primer sequences from the barcodes, both the original and trimmed length will be shown in the report, but the Hamming-distance will be calculated only on the trimmed sequences.
Sequences within a Taxonomic node will be clustered prior to the alignement and calculation of the Hamming distance. This allows to reduce redundancy in the database. Dereplication is performed by grouping sequences with 100% identity within a taxonomic node. Note that alignement of ambiguous nucleotides never incures a penalty. Therefore sequences containing ambiguous nuclotides can be clustered with sequences that contain strict nucleotide (A, T, U, C, G).
The pipeline produces an HTML report located in workdir/reports
as well as different
CSV files that can be programmatically used for further analysis.
BAnalyzer is built with Snakemake and uses:
For new features or to report bugs please submit issues directly on the online repository.
This project is licensed under a BSD 3-Clauses License, see the LICENSE file for details.
For questions about the pipeline, problems, suggestions or requests, feel free to contact:
Grégoire Denay, Chemisches- und Veterinär-Untersuchungsamt Rhein-Ruhr-Wupper