Microbial Genome Assembly Pipeline

Synopsis

This pipeline runs various assembly and post-assembly steps on Illumina PE/SE reads. The different steps of the pipeline are - downsample reads using Seqtk/Mash (optional), clean reads with Trimmomatic, assemble clean reads with SPAdes assembler, perform post-assembly correction with PILON and annotate finished assemblies with Prokka.

Installation

The pipeline can be set up in two easy steps:

Clone the github directory onto your system.

git clone https://github.com/alipirani88/assemblage.git

Use assemblage/assemblage.yml and assemblage/assemblage_report.yml files to create conda environment.

Create two new environments - varcall and varcall_gubbins

conda env create -f assemblage/assemblage.yml -n assemblage
conda env create -f assemblage/assemblage_report.yml -n assemblage_report

Check installation

conda activate assemblage

python assemblage/assemblage.py -h

Input

Input is a directory(-readsdir) containing SE/PE reads and a config file where all the configuration settings for the pipeline are set. This config file settings will be used universally on all samples available in readsdir. An example config file with default parameters are included in the pipeline folder. You can customize this config file and provide it with the -config argument.

Detailed information in section Customizing Config file

Note: Apart from standard Miseq/Hiseq fastq naming extensions (R1_001_final.fastq.gz), other acceptable fastq extensions are: R1.fastq.gz/_R1.fastq.gz, 1_combine.fastq.gz, 1_sequence.fastq.gz, _forward.fastq.gz, _1.fastq.gz/.1.fastq.gz.

Quick Start

Generate and run assembly jobs for a set of PE reads.


python assemblage/assemblage.py -dir /Path-t-/Reads-dir/ -out_dir /Path-to/output-dir/ -type PE -email username@umich.edu -resources "--nodes=1 --ntasks=1 --cpus-per-task=1 --mem=5g --time=50:00:00" -scheduler SLURM -coverage_depth 150 -downsample yes -config assemblage/config

The above command will generate and run assembly jobs for a set of PE reads residing in Reads-dir. The results will be saved in output directory output-dir.

Optional - The config file contains options for some frequently used reference genome to use with ABACAS for reordering. To know which reference genomes are included in config file, look up the config file or check the help menu of the pipeline.
The assembly will be placed in an individual folder generated for each sample in output directory.
A log file for each sample will be generated and can be found in each sample folder inside the output directory. A single log file of this step will be generated in main output directory. For more information on log file prefix and convention, please refer log section below.

Gather and Generate a Multiqc report for the assembly results.


conda activate assemblage_report

python assemblage/report.py -out_dir /Path-to/output-dir/

Output

The report script will gather the assembly fasta file in Results/YYYY-MM-DD_assembly and YYYY-MM-DD_plasmid_assembly folders. It will run and generate QUAST/multiqc results in /Report/Quast/
Results for each sample can be found in its own individual folder. Each sample folder will contain the assembly fasta file with a suffix _l500_contigs.fasta and _l500_plasmid_contigs.fasta.
Prokka results can be found in _prokka directory.

Customizing Config file:

By default, the pipeline uses config file that comes with the pipeline. Make sure to edit this config file or copy it to your local system, edit it and provide path of this edited config file with -config argument.


cp assembly_umich/config /Path-to-local/config_edit

The pipeline implements customisable variant calling configurations using config file. Config file can be customised to use your choice of tools and custom parameters.

If you wish to run the jobs on cluster, make sure you change scheduler parameters in scheduler section shown below: for more information, visit flux homepage.


[scheduler]
resources: nodes=1:ppn=4,pmem=4000mb,walltime=24:00:00
email: username@umich.edu
queue: XXX
flux_account: XXX
notification: a

Every tool has its own *_bin option where you can set the folder name in which the tool resides. For example, in the below Trimmomatic section example, the Trimmomatic tool resides in /Trimmomatic/ folder that is set with trimmomatic_bin option which in itself resides in /nfs/esnitkin/bin_group/assembly_umich/ folder that was set in binbase option above.

[Trimmomatic]
trimmomatic_bin: /Trimmomatic/
adaptor_filepath: adapters/TruSeq3-Nextera_PE_combined.fa
seed_mismatches: 2
palindrome_clipthreshold: 30
simple_clipthreshold: 10
minadapterlength: 8
keep_both_reads: true
window_size: 4
window_size_quality: 20
minlength: 40
headcrop_length: 0
colon: :
targetlength: 125
crop_length: 40
f_p: forward_paired.fq.gz
f_up: forward_unpaired.fq.gz
r_p: reverse_paired.fq.gz
r_up: reverse_unpaired.fq.gz

Parameters for each tools can be customised under the 'tool_parameter' attribute of each tool in config file.

For example, to change the minadapterlength parameter of Trimmomatic from 8 to 10, replace minadapterlength of 8 with suppose 10 and restart the pipeline.

Log:

The pipeline generates a log file following the naming convention: yyyy_mm_dd_hrs_mins_secs_analysisname.log.txt and tracks each event/command. The log file sections follow standard Python logging conventions:

INFO to print STDOUT messages;

DEBUG to print commands ran by pipeline,

ERROR to print STDERR messages and

EXCEPTION to print an exception that occured while the pipeline was running.

NailouZhang/assemblage