/SnakePipelines

A bioinformatics workflow to process and analyze the Next Generation Sequencing (NGS) reads

Primary LanguageShell

Introduction


This is a bioinformatics workflow that built using snakemake with the aim to automatic downstream processing of Next Generation Sequencing (NGS) reads, currently supports only paired-end Illumina sequence data. This project bundles a number of snakefiles for a de novo assembly, pan-genome and detecting genes for antimicrobial resistance as well as virulence. This is a brief description on the snakefiles

De novo assembly snakefile

denovoassembly.Snakefile

  • Quality assessment of raw reads using FASTQC
  • quality trimming using SICKLE
  • SPAdes for a de novo assembly
  • Contig filteration, including lengths and coverage
  • Taxonomic classification of fastq reads using minikraken
  • Taxonomic classification of assembled contigs using kraken
  • Calculate read coverage stats for mapped reads
  • Quality assessment of assemblies using quast
  • A final informative .html report using MULTIQC

Pan-genome snakefile

pangenome.Snakefile

  • Prokka for a rapid contig annotation
  • Roary to construct a pangenome
  • FastTree to create a ML phylogeny

To-Do

  • replace run in rules with shell, so conda packages will be dowanloaded and used
  • allow each of the snakefiles to run independently if needed

Current limitation

  • Visualising the phylogentic tree and plotting metdata is done outside Snakemake
  • The option --use-conda within snakemake is not currently feasible
  • The config file MUST include both fastq and fasta files

Dependencies


The workflow was used with the following versions of software

Getting Started


Setting up a project folder and obtain the latest version of the workflow

  • Set up a project folder for the run

      mkdir NGS-Project
      cd NGS-Project
    
  • Download the latest version from gitlab

      git clone https://gitlab.com/Mostafa.Abdel-Glil/snakepipelines_bacterialgenomes.git
    

Generate a config file for the pipeline

A bash script generateConfigSnakemake.sh is written to automatically generate a config file in yaml format providing a folder that holds the raw data as well as the already assembled genomes. The produced config file list all raw data and assembled genome and contain the paths of databases and scripts. Some editing to config file is essential to set up the paths for databases

USAGE:
   bash ./Scripts/generateConfigSnakemake.sh -d DIR/ -o File -g STRING

DESCRIPTION:
   Generate Config file for Snakemake workflow in yaml format.

REQUIRED ARGUMENTS:
   -d, --directory DIR
      directory path where fastq/fasta files are stored.
   -o, --output STRING
      The output config file for Snakemake workflow in yaml format.
   -g, --genus STRING
      The name of genus for the fastq/fasta files

OPTIONAL ARGUMENTS:
   -h, --help
      Show this message.

EXAMPLE:
   bash ./Scripts/generateConfigSnakemake.sh -d ./fastqReads/ -o config.yaml -g Campylobacter

The following paths for databases should be adjusted in the config file.

DB:
  minikraken: /home/mostafa.abdel/dbs/miniKraken/minikraken_20171019_8GB
  kraken: /home/DB_RAM/KrakenDB
  krona: /home/DB/Krona_Taxonomy
AMR_db:  /data/AGr110/mostafa/ariba_fmt_dbs/ariba_Card
VF_db: /data/AGr110/mostafa/ariba_fmt_dbs/ariba_vfdb_full
micDATA: /home/mostafa.abdel/aProjects/Campylobacter/snakemakeProject/Final-Snake-Project/data/micData.txt
AMR_db_abricate: card
VF_db_abricate: vfdb
tools:
  scripts_dir: /home/mostafa.abdel/aProjects/Campylobacter/snakemakeProject/Final-Snake-Project/Scripts
  multiqc_bin: /home/mostafa.abdel/.local/bin
directories:
  snakemake_folder: /home/mostafa.abdel/aProjects/Campylobacter/snakemakeProject/Final-Snake-Project

Running the snake pipeline

It is always a good idea to display what the workflow will do without execution. For doing that, we will use the follwoing command.

snakemake -np --quiet --snakefile master.Snakefile

Execute the commands in the pipeline by removing the -np option

snakemake --snakefile master.Snakefile

Contact


Comments should be addressed to Mostafa.Abdel-Glil@fli.de