wgs_std_viper

Simple workflow to map WGS data to reference genome according to GATK best practices

💬 Introduction

This snakemake workflow produces .bam files from Illumina NGS .fastq files according to GATK best practices. It is designed to align WGS data from paired end sequencing runs. In addition, a coverage analysis is performed and several alignment statistics as well as a FastQC report are generated.

❗ Dependencies

To run this workflow, the following tools need to be available:

🎒 Preparations

Sample data

Add all sample ids to samples.tsv in the column sample.
Add all sample data information to units.tsv. Each row represents a fastq file pair with corresponding forward and reverse reads. Also indicate the sample id, run id and lane number.

Reference data

You need a reference .fasta file to map your reads to. For the different tools to work, you also need to prepare index files and a .dict file.

The required files for the human reference genome GRCh38 can be downloaded from google cloud. The download can be manually done using the browser or using gsutil via the command line:

gsutil cp gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta /path/to/download/dir/

If those resources are not available for your reference you may generate them yourself:

bwa index /path/to/reference.fasta
samtools faidx /path/to/reference.fasta
gatk CreateSequenceDictionary -R /path/to/reference.fasta -O /path/to/reference.dict

For the task BaseRecalibrator we use a .vcf containing known indels. For GRCh38, this is available at google as well.
To generate a WGS metrics report, the task CollectWgsMetrics requires a intervals file which is also available at google cloud.
For parallel processing, the .bam file is split by chromosome depending on a .txt file containing a list with chromosome ids. Chromosome which are not included in this file will be excluded downstream!
Add the paths of the different files to the config.yaml. Index files and the .dict file should be in the same directory as the reference .fasta.
Make sure that adapter sequences and docker container versions are correct.

✅ Testing

The workflow repository contains a small test dataset .tests/integration which can be run like so:

cd .tests/integration
snakemake -s ../../Snakefile -j1 --use-singularity

🚀 Usage

The workflow is designed for WGS data meaning huge datasets which require a lot of compute power. For HPC clusters, it is recommended to use a cluster profile and run something like:

snakemake -s /path/to/Snakefile --profile my-awesome-profile

marrip/wgs_std_viper