/WGSdownstreamQC-snakemake

workflow for downstream QC on WGS data

Primary LanguagePython

Post variant calling QC of WGS pipeline

The pipeline will perform some QC steps on a complete WGS dataset, on both variants and samples.

Variant related QC and metrics:

  • VQSR reapply
  • HWE
  • Heterozygosity rate
  • Call rate
  • Allele Frequency comparison with outbred populations

For a subset of individuals (all the INGI-FVG samples) also the following information were extracted:

  • Allele Frequency comparison with available snp array data
  • Non Reference Discordance (NRD) comparison with available snp array data

Sample related:

  • Singleton distribution
  • Missingness
  • Coverage
  • Heterozygosity rate
  • PCA
  • PCA with 1000G

The pipeline will generate lists of samples or sites flagged for removal to generate the final dataset. Only sites exceeding the user specified thresholds for HWE pvalue and missing rate will be removed automatically.

The pipeline will generate the following plots, to help data interpretation:

  • Singletons vs NRD
  • Allele Frequency comparison with outbred pop (all variants)
  • Allele Frequency comparison with outbred pop (most diverse variants)
  • Allele Frequency comparison with SNP array data (all variants)
  • Allele Frequency comparison with SNP array data (most diverse variants)
  • Read Depth (DP) vs Singletons
  • Het Rate distribution by sample
  • PCA
  • PCA with 1000G

The final data release has to be generated by hand, excluding samples and sites following the evaluation of the generated summaries (tables and plots).


Setting things up

In order to run the pipeline, there are some requirements to fullfill and some set up needs to be perfomed. In this current version, the pipeline is tested and configured to be run on the ORFEO cluster . It is possible to run the pipeline on the Apollo cluster, but it will require to manually specify the location of all software binaries in the provided config file.

Required Software

The following software has to be installed system-wide, in a user-defined Conda environment or using the modules architecture (ORFEO cluster).

At the moment, the load directive of each module on the ORFEO cluster is hard-coded in the pipeline code, with the form "module-name/version". Modules and versions used are:

  • bcftools/1.14
  • vcftools/0.1.16
  • plink/1.90
  • R/4.0.3

Before switching to a new version of each software/module, a test run should be performed to check that the expected output files are generated and that they are consistent with the previous production version.

Required python packages

In order to run the pipeline, the following python packages have to be installed in your conda environment or system wide:

  • errno
  • gzip
  • io
  • multiprocessing
  • os
  • pandas
  • pathlib
  • psutil
  • re
  • sys
  • matplotlib
  • numpy
  • collections

ORFEO/general set up

  1. Install Snakemake via conda (link);
    conda create -c conda-forge -c bioconda -n snakemake snakemake pandas
  2. Activate the environment you created
    conda activate snakemake

Apollo set up

  1. Add the global snakemake environment to your environment list:

    conda config --append envs_dirs /shared/software/conda/envs
    conda config --prepend envs_dirs ~/.conda/envs
  2. Check that the environment is available to you (you should see an entry "snakemake_g" in the list)

    conda env list
  3. Load the environment

    conda activate snakemake_g

Resources set up

In order to run the pipeline, some resources have to be available on you cluster/environment.

Genome reference

At the moment, all resources needed are already available on the APOLLO cluster, at the following locations:

################### REFERENCES  #####################################
#paths and subfolders containing data we need for the pipeline to run
references:
  basepath: "/shared/resources"
  provider: "hgRef"
  release: "GRCh38.p13"

genome_fasta: "GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna"

Outbred population references

In order to perform allele frequency comparison with external populations, two resources are required, and their path has to be specified in the RULES block of the config file, with the parameters "1000G_subpop" and "TGP2504":

1000G_subpop: "/storage/burlo/cocca/resources/1000GP_phase3_3202/vcf/EUR/EUR_normIndel_multiSplitted.vcf.gz.tab" #full path of the file containing AF information for fomparison with the study population. This file should be defined according to the main ancestry represented in your population
TGP2504: "/storage/burlo/cocca/resources/1000GP_phase3_3202/vcf/TGP_2504/TGP2504_normIndel_multiSplitted.vcf.gz.tab" #full path of the file containing AF information for fomparison with the study population

The examples provided here refer to resources available on the ORFEO cluster. The main path for resources on the Apollo cluster is:

/shared/resources

PCA resources

PCA analyses is performed using the King software . In roder to perform the analyses, the user need to define the correct resource path using the parameter 1000G_ref_for_king in the config file.

The default value below is set to work on the ORFEO cluster

1000G_ref_for_king: "/storage/burlo/cocca/resources/1000GP_phase3/king/KGref.bed"

The corresponding file on the Apollo cluster, is located at:

/shared/resources/1000G/KGref.bed

SNP array data

In order to perform Non Reference Discordance calculation, the user need to provide the pipeline a VCF file, previously formatted (correct genome reference build, correct REF/ALT alignment). The absolute path to this file has to be provided with the parameter snp_array_data in the PATH block of the config file.

This comparison is useful to detect sample contamination and genotyping errors, by comparison with a more trusted dataset. So it is advisable to perform the comparison even if other genetic data is available only for a subset of samples. In case no other data is available for any sample in the current callset, the parameter in the config file should left to the default value "NONE", in order to disable this comparison.

Other requirements

Manifest file following the template provided in the resource folder, space separated and with the header line:

SAMPLE_ID COHORT sex SEQ

The absolute path of this file has to be specified with the parameter "manifest_table" in the PATH section of the config file.


Config file setup

A config file template, with default values is provided and it is named config.yaml . The final user has to provide some information to the pipeline, in order to retrieve input files and to define output location.

Project name

It is advisable to provide a meaningful project name, since this name will be used as a prefix for the production of pipeline outputs.

########### INPUT DEFINITION ####################
proj_name : "WGS_releaseQC" #name of the project to be used also as suffix for file naming in the final concat rule

Chromosome definition

The user can specify the chromosomes to be processed. The X chromosome coding is "chrX". During the pipeline run, some rules will work only on AUTOSOMAL chromosomes.

#################################################################################
chrs: ["chr1","chr2","chr3","chr4","chr5","chr6","chr7","chr8","chr9","chr10","chr11","chr12","chr13","chr14","chr15","chr16","chr17","chr18","chr19","chr20","chr21","chr22","chrX"]
############################################

Rules parameters

This section of the config file contains parameters that should be modified only if the relevant rules o tools change because of new software version. In this section the user will find the parameters "1000G_subpop", "TGP2504" to be set accordingly to the available resources, as described above.

Moreover, in this section, the user will find two parameters :

  • hwe_thr
  • missing_thr

that are set to default values and that will be used to perform the first hard filter on the dataset. Those parameters can be changed in case more stringent values are required.

In addition, the following parameter

  • exc_het_pval_thr

could be set to a more stringent value, and it will cause the flagging of more variants for exclusion due to excess of heterozygosity p-value (more details here. This value is calculated by bcftools, at this stage.) .

Paths definition

The bottom section of the config file allow the user to specify input and output files paths and the paths of each tool used.

Regarding output paths, if a relative path is provided, all folders will be generated in a path relative to the current pipeline execution folder.

################ PATHS #################
### - INPUT and OUTPUT FILES PATHS - ###
paths:
  input_vcf: "input_file_path" #absolute path for the input files to be QC-ed. It should contain a single file with all samples and chromosomes
  base_out: "base_output_folder" #base folder for the output
  log_dir: "Log" #base path for log folder
  benchmark: "benchmark" #base path for benchmark folder
  tmp: "localtemp" #base path for tmp folder
  manifest_table: "" #sex table formatted with spaces as separator, and "SAMPLES_ID COHORT sex SEQ" as header. Sex should be coded as 1=Male, 2=Female
  1000G_ref_for_king: "/storage/burlo/cocca/resources/1000GP_phase3/king/KGref.bed" #files reference for king pca projection
  snp_array_data: "NONE" #path and prefix for the snp array data to be used for NRDR calculations . Set to FALSE if no other data is available for any sample in the callset.

The PATH TOOL section contains path of the binary used in the pipeline. This path should be changed only if the software used are not available in the user $PATH , following a custom installation, for example.

### - PATH TOOL - ###
BCFTOOLS: "bcftools"
VCFTOOLS: "vcftools"
BEDTOOLS: "bedtools"
PLINK: "plink"
KING: "king"

Outputs

The pipeline produces a set of output files, that should be used to generate a clean data release. The main folder structure will be:

.
├── 01.VQSR_reapply
├── 02.sites
├── 03.samples
├── 04.nrdr
├── 05.stats
└── 06.release

With each folder containing output data related to a set of rules.

01.VQSR_reapply

This folder contains files generated after the first step of VQSR threshold reapplication, splitted by chromosomes. Those files will be then used in all subsequent steps of the pipeline.

02.sites

This folder contains files generated by all rules regarding variant filtering and QC.

.
├── arrayAF
├── HetRate
├── HweMissingClean
├── MissingRate
└── popAF

In the folder "HweMissingClean" will be saved the files automatically filtered by HWE and missing rate. Those files will be used for subsequent steps. The folder arrayAF will contain tables and plots regarding a comparison between the WGS data and the provided SNP array data. The files named as "WGS_ITA_PREREL_chr19_MERGED_ARRAY_af_extrDiff.txt" should be used for variants removal

03.samples

This folder contains files generated by all rules related to samples filtering and QC

04.nrdr

This folder contains files generated by the NRD rules, for both samples and variants.

05.stats

This folder contains stats files generated by bcftools and that could be used to check samples and sites parameters before and after the first step of filtering based on HWE and variant call rate.

06.release

This folder contains tables that should be used to generate lists of variants and samples to exclude from the call set.


Running the pipeline

There are different ways to run the pipeline: Local mode, Cluster mode or Single node mode

Local mode

In Local mode, the pipeline is execute in an interactive shell session (locally or on a cluster) and all the rules are treated as processes that can be run sequentially or in parallel, depending on the resources provided. One example of a Local execution is:

conda activate snakemake

base_cwd=/<USER_DEFINED_PATH>/WGS_QC
log_folder=${base_cwd}/Log
mkdir -p ${log_folder}
cd ${base_cwd}

snakefile=/<USER_DEFINED_PATH>/WGSdownstreamQC-snakemake/Snakefile
configfile=/<USER_DEFINED_PATH>/WGS_QC/WGSdownstreamQC_pipeline.yaml
cores=24

snakemake -p -r -s ${snakefile} --configfile ${configfile} --use-envmodules --keep-going --cores ${cores}

In this example we assumed we had 24 CPU available for our calculation

Cluster mode

In cluster mode, the pipeline runs on a interactive shell (screen or tmux) and each rule is submitted as a job on the cluster. One example of a Cluster execution, on the ORFEO cluster, is:

module load conda
conda activate snakemake

base_cwd=/<USER_DEFINED_PATH>/WGS_QC
log_folder=${base_cwd}/Log
mkdir -p ${log_folder}
cd ${base_cwd}

snakefile=/<USER_DEFINED_PATH>/WGSdownstreamQC-snakemake/Snakefile
configfile=/<USER_DEFINED_PATH>/WGS_QC/WGSdownstreamQC_pipeline.yaml

log_name=WGS-QC_pipeline.log
stderr_name=WGS-QC_pipeline.err
cores=24
queue=thin

snakemake -p -r -s ${snakefile} --configfile ${configfile} --use-envmodules --keep-going --cluster "qsub -q ${queue} -V -k eod -l select=1:ncpus={threads}:mem={resources.mem_mb}mb -l walltime=96:00:00" -j ${cores} 1> ${log_name} 2> ${stderr_name}

In this example we defined also the name for two additional log files, which will help to keep track of the pipeline execution. In this case, the -j option will define how many concurrent jobs are submitted on the cluster.

Single node mode

In Single node mode, the pipeline runs as a job on the cluster and all rules are treated as processes that can be run sequentially or in parallel, depending on the resources provided. Similar to the Local execution mode. One example of a single node mode execution, on the ORFEO cluster, is:

module load conda
conda activate snakemake

base_cwd=/<USER_DEFINED_PATH>/WGS_QC
log_folder=${base_cwd}/Log
mkdir -p ${log_folder}
cd ${base_cwd}

snakefile=/<USER_DEFINED_PATH>/WGSdownstreamQC-snakemake/Snakefile
configfile=/<USER_DEFINED_PATH>/WGS_QC/WGSdownstreamQC_pipeline.yaml

log_name=WGS-QC_pipeline.log
stderr_name=WGS-QC_pipeline.err

cores=24
queue=thin
mem=750g

echo "cd ${base_cwd};module load conda;conda activate snakemake; snakemake -p -r -s ${snakefile} --configfile ${configfile} --cores ${cores} --use-envmodules --keep-going" | qsub -N snake_preprocessing -q ${queue} -V -k eod -o ${log_folder}/${log_name} -e ${log_folder}/${stderr_name} -l select=1:ncpus=${cores}:mem=${mem} -l walltime=96:00:00

In this example we selected an entire cluster node on the "thin" queue of the ORFEO cluster, defining the number of CPU (24) and the total amount of RAM required to run the pipeline (750g). We defined also the name for the two additional log files, to keep track of the pipeline execution.