The pipeline will perform some QC steps on a complete WGS dataset, on both variants and samples.
Variant related QC and metrics:
- VQSR reapply
- HWE
- Heterozygosity rate
- Call rate
- Allele Frequency comparison with outbred populations
For a subset of individuals (all the INGI-FVG samples) also the following information were extracted:
- Allele Frequency comparison with available snp array data
- Non Reference Discordance (NRD) comparison with available snp array data
Sample related:
- Singleton distribution
- Missingness
- Coverage
- Heterozygosity rate
- PCA
- PCA with 1000G
The pipeline will generate lists of samples or sites flagged for removal to generate the final dataset. Only sites exceeding the user specified thresholds for HWE pvalue and missing rate will be removed automatically.
The pipeline will generate the following plots, to help data interpretation:
- Singletons vs NRD
- Allele Frequency comparison with outbred pop (all variants)
- Allele Frequency comparison with outbred pop (most diverse variants)
- Allele Frequency comparison with SNP array data (all variants)
- Allele Frequency comparison with SNP array data (most diverse variants)
- Read Depth (DP) vs Singletons
- Het Rate distribution by sample
- PCA
- PCA with 1000G
The final data release has to be generated by hand, excluding samples and sites following the evaluation of the generated summaries (tables and plots).
In order to run the pipeline, there are some requirements to fullfill and some set up needs to be perfomed. In this current version, the pipeline is tested and configured to be run on the ORFEO cluster . It is possible to run the pipeline on the Apollo cluster, but it will require to manually specify the location of all software binaries in the provided config file.
The following software has to be installed system-wide, in a user-defined Conda environment or using the modules architecture (ORFEO cluster).
At the moment, the load directive of each module on the ORFEO cluster is hard-coded in the pipeline code, with the form "module-name/version". Modules and versions used are:
- bcftools/1.14
- vcftools/0.1.16
- plink/1.90
- R/4.0.3
Before switching to a new version of each software/module, a test run should be performed to check that the expected output files are generated and that they are consistent with the previous production version.
In order to run the pipeline, the following python packages have to be installed in your conda environment or system wide:
- errno
- gzip
- io
- multiprocessing
- os
- pandas
- pathlib
- psutil
- re
- sys
- matplotlib
- numpy
- collections
- Install Snakemake via conda (link);
conda create -c conda-forge -c bioconda -n snakemake snakemake pandas
- Activate the environment you created
conda activate snakemake
-
Add the global snakemake environment to your environment list:
conda config --append envs_dirs /shared/software/conda/envs conda config --prepend envs_dirs ~/.conda/envs
-
Check that the environment is available to you (you should see an entry "snakemake_g" in the list)
conda env list
-
Load the environment
conda activate snakemake_g
In order to run the pipeline, some resources have to be available on you cluster/environment.
At the moment, all resources needed are already available on the APOLLO cluster, at the following locations:
################### REFERENCES #####################################
#paths and subfolders containing data we need for the pipeline to run
references:
basepath: "/shared/resources"
provider: "hgRef"
release: "GRCh38.p13"
genome_fasta: "GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna"
In order to perform allele frequency comparison with external populations, two resources are required, and their path has to be specified in the RULES block of the config file, with the parameters "1000G_subpop" and "TGP2504":
1000G_subpop: "/storage/burlo/cocca/resources/1000GP_phase3_3202/vcf/EUR/EUR_normIndel_multiSplitted.vcf.gz.tab" #full path of the file containing AF information for fomparison with the study population. This file should be defined according to the main ancestry represented in your population
TGP2504: "/storage/burlo/cocca/resources/1000GP_phase3_3202/vcf/TGP_2504/TGP2504_normIndel_multiSplitted.vcf.gz.tab" #full path of the file containing AF information for fomparison with the study population
The examples provided here refer to resources available on the ORFEO cluster. The main path for resources on the Apollo cluster is:
/shared/resources
PCA analyses is performed using the King software . In roder to perform the analyses, the user need to define the correct resource path using the parameter 1000G_ref_for_king in the config file.
The default value below is set to work on the ORFEO cluster
1000G_ref_for_king: "/storage/burlo/cocca/resources/1000GP_phase3/king/KGref.bed"
The corresponding file on the Apollo cluster, is located at:
/shared/resources/1000G/KGref.bed
In order to perform Non Reference Discordance calculation, the user need to provide the pipeline a VCF file, previously formatted (correct genome reference build, correct REF/ALT alignment). The absolute path to this file has to be provided with the parameter snp_array_data in the PATH block of the config file.
This comparison is useful to detect sample contamination and genotyping errors, by comparison with a more trusted dataset. So it is advisable to perform the comparison even if other genetic data is available only for a subset of samples. In case no other data is available for any sample in the current callset, the parameter in the config file should left to the default value "NONE", in order to disable this comparison.
Manifest file following the template provided in the resource folder, space separated and with the header line:
SAMPLE_ID COHORT sex SEQ
The absolute path of this file has to be specified with the parameter "manifest_table" in the PATH section of the config file.
A config file template, with default values is provided and it is named config.yaml . The final user has to provide some information to the pipeline, in order to retrieve input files and to define output location.
It is advisable to provide a meaningful project name, since this name will be used as a prefix for the production of pipeline outputs.
########### INPUT DEFINITION ####################
proj_name : "WGS_releaseQC" #name of the project to be used also as suffix for file naming in the final concat rule
The user can specify the chromosomes to be processed. The X chromosome coding is "chrX". During the pipeline run, some rules will work only on AUTOSOMAL chromosomes.
#################################################################################
chrs: ["chr1","chr2","chr3","chr4","chr5","chr6","chr7","chr8","chr9","chr10","chr11","chr12","chr13","chr14","chr15","chr16","chr17","chr18","chr19","chr20","chr21","chr22","chrX"]
############################################
This section of the config file contains parameters that should be modified only if the relevant rules o tools change because of new software version. In this section the user will find the parameters "1000G_subpop", "TGP2504" to be set accordingly to the available resources, as described above.
Moreover, in this section, the user will find two parameters :
- hwe_thr
- missing_thr
that are set to default values and that will be used to perform the first hard filter on the dataset. Those parameters can be changed in case more stringent values are required.
In addition, the following parameter
- exc_het_pval_thr
could be set to a more stringent value, and it will cause the flagging of more variants for exclusion due to excess of heterozygosity p-value (more details here. This value is calculated by bcftools, at this stage.) .
The bottom section of the config file allow the user to specify input and output files paths and the paths of each tool used.
Regarding output paths, if a relative path is provided, all folders will be generated in a path relative to the current pipeline execution folder.
################ PATHS #################
### - INPUT and OUTPUT FILES PATHS - ###
paths:
input_vcf: "input_file_path" #absolute path for the input files to be QC-ed. It should contain a single file with all samples and chromosomes
base_out: "base_output_folder" #base folder for the output
log_dir: "Log" #base path for log folder
benchmark: "benchmark" #base path for benchmark folder
tmp: "localtemp" #base path for tmp folder
manifest_table: "" #sex table formatted with spaces as separator, and "SAMPLES_ID COHORT sex SEQ" as header. Sex should be coded as 1=Male, 2=Female
1000G_ref_for_king: "/storage/burlo/cocca/resources/1000GP_phase3/king/KGref.bed" #files reference for king pca projection
snp_array_data: "NONE" #path and prefix for the snp array data to be used for NRDR calculations . Set to FALSE if no other data is available for any sample in the callset.
The PATH TOOL section contains path of the binary used in the pipeline. This path should be changed only if the software used are not available in the user $PATH , following a custom installation, for example.
### - PATH TOOL - ###
BCFTOOLS: "bcftools"
VCFTOOLS: "vcftools"
BEDTOOLS: "bedtools"
PLINK: "plink"
KING: "king"
The pipeline produces a set of output files, that should be used to generate a clean data release. The main folder structure will be:
.
├── 01.VQSR_reapply
├── 02.sites
├── 03.samples
├── 04.nrdr
├── 05.stats
└── 06.release
With each folder containing output data related to a set of rules.
This folder contains files generated after the first step of VQSR threshold reapplication, splitted by chromosomes. Those files will be then used in all subsequent steps of the pipeline.
This folder contains files generated by all rules regarding variant filtering and QC.
.
├── arrayAF
├── HetRate
├── HweMissingClean
├── MissingRate
└── popAF
In the folder "HweMissingClean" will be saved the files automatically filtered by HWE and missing rate. Those files will be used for subsequent steps. The folder arrayAF will contain tables and plots regarding a comparison between the WGS data and the provided SNP array data. The files named as "WGS_ITA_PREREL_chr19_MERGED_ARRAY_af_extrDiff.txt" should be used for variants removal
This folder contains files generated by all rules related to samples filtering and QC
This folder contains files generated by the NRD rules, for both samples and variants.
This folder contains stats files generated by bcftools and that could be used to check samples and sites parameters before and after the first step of filtering based on HWE and variant call rate.
This folder contains tables that should be used to generate lists of variants and samples to exclude from the call set.
There are different ways to run the pipeline: Local mode, Cluster mode or Single node mode
In Local mode, the pipeline is execute in an interactive shell session (locally or on a cluster) and all the rules are treated as processes that can be run sequentially or in parallel, depending on the resources provided. One example of a Local execution is:
conda activate snakemake
base_cwd=/<USER_DEFINED_PATH>/WGS_QC
log_folder=${base_cwd}/Log
mkdir -p ${log_folder}
cd ${base_cwd}
snakefile=/<USER_DEFINED_PATH>/WGSdownstreamQC-snakemake/Snakefile
configfile=/<USER_DEFINED_PATH>/WGS_QC/WGSdownstreamQC_pipeline.yaml
cores=24
snakemake -p -r -s ${snakefile} --configfile ${configfile} --use-envmodules --keep-going --cores ${cores}
In this example we assumed we had 24 CPU available for our calculation
In cluster mode, the pipeline runs on a interactive shell (screen or tmux) and each rule is submitted as a job on the cluster. One example of a Cluster execution, on the ORFEO cluster, is:
module load conda
conda activate snakemake
base_cwd=/<USER_DEFINED_PATH>/WGS_QC
log_folder=${base_cwd}/Log
mkdir -p ${log_folder}
cd ${base_cwd}
snakefile=/<USER_DEFINED_PATH>/WGSdownstreamQC-snakemake/Snakefile
configfile=/<USER_DEFINED_PATH>/WGS_QC/WGSdownstreamQC_pipeline.yaml
log_name=WGS-QC_pipeline.log
stderr_name=WGS-QC_pipeline.err
cores=24
queue=thin
snakemake -p -r -s ${snakefile} --configfile ${configfile} --use-envmodules --keep-going --cluster "qsub -q ${queue} -V -k eod -l select=1:ncpus={threads}:mem={resources.mem_mb}mb -l walltime=96:00:00" -j ${cores} 1> ${log_name} 2> ${stderr_name}
In this example we defined also the name for two additional log files, which will help to keep track of the pipeline execution. In this case, the -j option will define how many concurrent jobs are submitted on the cluster.
In Single node mode, the pipeline runs as a job on the cluster and all rules are treated as processes that can be run sequentially or in parallel, depending on the resources provided. Similar to the Local execution mode. One example of a single node mode execution, on the ORFEO cluster, is:
module load conda
conda activate snakemake
base_cwd=/<USER_DEFINED_PATH>/WGS_QC
log_folder=${base_cwd}/Log
mkdir -p ${log_folder}
cd ${base_cwd}
snakefile=/<USER_DEFINED_PATH>/WGSdownstreamQC-snakemake/Snakefile
configfile=/<USER_DEFINED_PATH>/WGS_QC/WGSdownstreamQC_pipeline.yaml
log_name=WGS-QC_pipeline.log
stderr_name=WGS-QC_pipeline.err
cores=24
queue=thin
mem=750g
echo "cd ${base_cwd};module load conda;conda activate snakemake; snakemake -p -r -s ${snakefile} --configfile ${configfile} --cores ${cores} --use-envmodules --keep-going" | qsub -N snake_preprocessing -q ${queue} -V -k eod -o ${log_folder}/${log_name} -e ${log_folder}/${stderr_name} -l select=1:ncpus=${cores}:mem=${mem} -l walltime=96:00:00
In this example we selected an entire cluster node on the "thin" queue of the ORFEO cluster, defining the number of CPU (24) and the total amount of RAM required to run the pipeline (750g). We defined also the name for the two additional log files, to keep track of the pipeline execution.