PacBio WGS Variant Pipeline

Workflow for analyzing human PacBio whole genome sequencing (WGS) data using Workflow Description Language (WDL).

Docker images used by this workflow are defined in the wdl-dockerfiles repo. Images are hosted in PacBio's quay.io.
Common tasks that may be reused within or between workflows are defined in the wdl-common repo.

Workflow

Workflow entrypoint: workflows/main.wdl

PacBio WGS Variant Pipeline performs read alignment, variant calling, and phasing. Joint-calling of small variants and structural variants for cohorts and optional variant filtering and annotation is also available for HiFi human WGS. The workflow can run using Azure, AWS, GCP, and HPC backends.

Setup

We recommend cloning the repo rather than downloading the release package. Some tasks and workflows are pulled in from other repositories. Ensure you have initialized submodules following cloning by running git submodule update --init --recursive.

Resource requirements

The workflow requires at minimum 64 cores and 256 GB of RAM. Ensure that the backend environment you're using has enough quota to run the workflow.

Reference datasets and associated workflow files

Reference datasets are hosted publicly for use in the pipeline. For data locations, see the backend-specific documentation and template inputs files for each backend with paths to publicly hosted reference files filled out.

Running the workflow

Select a backend environment
Configure a workflow execution engine in the chosen environment
Fill out the inputs JSON file for your cohort
Run the workflow

Selecting a backend

The workflow can be run on Azure, AWS, GCP, or HPC. Your choice of backend will largely be determined by the location of your data.

For backend-specific configuration, see the relevant documentation:

Configuring a workflow engine and container runtime

An execution engine is required to run workflows. Two popular engines for running WDL-based workflows are miniwdl and Cromwell.

Because workflow dependencies are containerized, a container runtime is required. This workflow has been tested with Docker and Singularity container runtimes.

See the backend-specific documentation for details on setting up an engine.

Engine	Azure	AWS	GCP	HPC
miniwdl	Unsupported	Supported via the Amazon Genomics CLI	Unsupported	(SLURM only) Supported via the `miniwdl-slurm` plugin
Cromwell	Supported via Cromwell on Azure	Supported via the Amazon Genomics CLI	Supported via Google's Pipelines API	Supported - Configuration varies depending on HPC infrastructure

Filling out the inputs JSON

The input to a workflow run is defined in JSON format. Template input files with reference dataset information filled out are available for each backend:

Using the appropriate inputs template file, fill in the cohort and sample information (see Workflow Inputs for more information on the input structure).

If using an HPC backend, you will need to download the reference bundle and replace the <local_path_prefix> in the input template file with the local path to the reference datasets on your HPC.

Running the workflow

Run the workflow using the engine and backend that you have configured (miniwdl, Cromwell).

Note that the calls to miniwdl and Cromwell assume you are accessing the engine directly on the machine on which it has been deployed. Depending on the backend you have configured, you may be able to submit workflows using different methods (e.g. using trigger files in Azure, or using the Amazon Genomics CLI in AWS).

Run directly using miniwdl

miniwdl run workflows/main.wdl -i <input_file_path.json>

Run directly using Cromwell

java -jar <cromwell_jar_path> run workflows/main.wdl -i <input_file_path.json>

If Cromwell is running in server mode, the workflow can be submitted using cURL. Fill in the values of CROMWELL_URL and INPUTS_JSON below, then from the root of the repository, run:

# The base URL (and port, if applicable) of your Cromwell server
CROMWELL_URL=
# The path to your inputs JSON file
INPUTS_JSON=

(cd workflows && zip -r dependencies.zip humanwgs_structs.wdl  cohort_analysis/ sample_analysis/ tertiary_analysis/ wdl-common/)
curl -X "POST" \
  "${CROMWELL_URL}/api/workflows/v1" \
  -H "accept: application/json" \
  -H "Content-Type: multipart/form-data" \
  -F "workflowSource=@workflows/main.wdl" \
  -F "workflowInputs=@${INPUTS_JSON};type=application/json" \
  -F "workflowDependencies=@workflows/dependencies.zip;type=application/zip"

To specify workflow options, add the following to the request (assuming your options file is a file called options.json located in the pwd): -F "workflowOptions=@options.json;type=application/json".

Workflow inputs

This section describes the inputs required for a run of the workflow. Typically, only the humanwgs.cohort and potentially run/backend-specific sections will be filled out by the user for each run of the workflow. Input templates with reference file locations filled out are provided for each backend.

Cohort

A cohort can include one or more samples. Samples need not be related, but if you plan to run tertiary analysis, it is best to think of a cohort as a family of related samples. We have tested cohorts of up to 5 samples with 30x coverage. Larger cohorts may encounter memory issues during joint calling.

Type	Name	Description
String	cohort_id	A unique name for the cohort; used to name outputs
Array[Sample]	samples	The set of samples for the cohort. At least one sample must be defined.
Array[String]	phenotypes	Human Phenotype Ontology (HPO) phenotypes associated with the cohort. If no particular phenotypes are desired, the root HPO term, `"HP:0000001"`, can be used.

Sample

Sample information for each sample in the workflow run.

Type	Name	Description	Notes
String	sample_id	A unique name for the sample; used to name outputs
Array[IndexData]	movie_bams	The set of unaligned movie BAMs associated with this sample
String?	sex	Sample sex	["MALE", "FEMALE", `null`]. If the sex field is missing or `null`, sex will be set to unknown. Used to set the expected sex chromosome karyotype for TRGT and HiFiCNV.
Boolean	affected	Is this sample affected by the phenotype?	[`true`, `false`]
String?	father_id	Paternal `sample_id`
String?	mother_id	Maternal `sample_id`

ReferenceData

Files associated with the reference genome.

These files are hosted publicly in each of the cloud backends; see backends/${backend}/inputs.${backend}.json.

Type	Name	Description	Notes
String	name	Reference name; used to name outputs (e.g., "GRCh38")	Note: The workflow currently only supports GRCh38 and provides GCA_000001405.15_GRCh38_no_alt_analysis_set.
IndexData	fasta	Reference genome and index
File	tandem_repeat_bed	Tandem repeat locations used by pbsv to normalize SV representation
File	trgt_tandem_repeat_bed	Tandem repeat sites to be genotyped by TRGT
IndexData	hificnv_exclude_bed	Compressed BED and index of regions to exclude from calling by HiFiCNV. We recommend cnv.excluded_regions.common_50.hg38.bed.gz.
File	hificnv_expected_bed_male	BED of expected copy number for male karyotype for HiFiCNV
File	hificnv_expected_bed_female	BED of expected copy number for female karyotype for HiFiCNV
File?	gnomad_af	gnomAD v3.1 allele frequences in `slivar gnotate` format	required if `run_tertiary_analysis` is set to `true`
File?	hprc_af	Allele frequences in ~100 Human Pangenome Reference Consortium (HPRC) samples in `slivar gnotate` format	required if `run_tertiary_analysis` is set to `true`
File?	gff	Ensembl GFF3 reference annotation	required if `run_tertiary_analysis` is set to `true`
Array[IndexData?]	population_vcfs	An array of structural variant population VCFs	required if `run_tertiary_analysis` is set to `true`

SlivarData

Files associated with slivar annotation. These are required if run_tertiary_analysis is set to true.

These files are hosted publicly in each of the cloud backends; see backends/${backend}/inputs.${backend}.json.

Type	Name	Description
File	slivar_js	Additional javascript functions for slivar
File	hpo_terms	HPO annotation lookups
File	hpo_dag	HPO annotation lookups
File	hpo_annotations	HPO annotation lookups
File	ensembl_to_hgnc	Ensembl to HGNC gene mapping
File	lof_lookup	Loss-of-function scores per gene
File	clinvar_lookup	ClinVar annotations per gene

Other inputs

Type	Name	Description	Notes
String?	deepvariant_version	Version of deepvariant to use ["1.5.0"]
DeepVariantModel?	deepvariant_model	Optional alternate DeepVariant model file to use
Int?	pbsv_call_mem_gb	Optionally set RAM (GB) for pbsv_call during cohort analysis
Int?	glnexus_mem_gb	Optionally set RAM (GB) for GLnexus during cohort analysis
Boolean?	run_tertiary_analysis	Run the optional tertiary analysis steps [`false`]	[`true`, `false`]
String	backend	Backend where the workflow will be executed	["Azure", "AWS", "GCP", "HPC"]
String?	zones	Zones where compute will take place; required if backend is set to 'AWS' or 'GCP'.	Determining available zones in AWS Determining available zones in GCP
String?	aws_spot_queue_arn	Queue ARN for the spot batch queue; required if backend is set to 'AWS' and `preemptible` is set to `true`	Determining the AWS queue ARN
String?	aws_on_demand_queue_arn	Queue ARN for the on demand batch queue; required if backend is set to 'AWS' and `preemptible` is set to `false`	Determining the AWS queue ARN
String?	container_registry	Container registry where workflow images are hosted. If left blank, PacBio's public Quay.io registry will be used.
Boolean	preemptible	If set to `true`, run tasks preemptibly where possible. On-demand VMs will be used only for tasks that run for >24 hours if the backend is set to GCP. If set to `false`, on-demand VMs will be used for every task. Ignored if backend is set to HPC.	[`true`, `false`]

Workflow outputs

Sample analysis

These files will be output for each sample defined in the cohort.

Type	Name	Description
Array[Array[File]]	bam_stats	TSV of length and quality for each read (per input BAM)
Array[Array[File]]	read_length_summary	For each input BAM, read length distribution (per input BAM)
Array[Array[File]]	read_quality_summary	For each input BAM, read quality distribution (per input BAM)
Array[IndexData]	small_variant_gvcfs	Small variants (SNPs and INDELs < 50bp) gVCFs called by DeepVariant (with index)
Array[File]	small_variant_vcf_stats	`bcftools stats` summary statistics for small variants
Array[File]	small_variant_roh_out	Output of `bcftools roh` using `--AF-dflt 0.4`
Array[File]	small_variant_roh_bed	Regions of homozygosity determiend by `bcftools roh` using `--AF-dflt 0.4`
Array[IndexData]	sample_phased_small_variant_vcfs	Small variants called by DeepVariant and phased by HiPhase (with index)
Array[IndexData]	sample_phased_sv_vcfs	Structural variants called by pbsv and phased by HiPhase (with index)
Array[File]	sample_hiphase_stats	Phase block summary statistics written by HiPhase
Array[File]	sample_hiphase_blocks	Phase block list written by HiPhase
Array[File]	sample_hiphase_haplotags	Per-read haplotag information, written by HiPhase
Array[IndexData]	merged_haplotagged_bam	Aligned (by pbmm2), haplotagged (by HiPhase) reads (with index)
Array[File]	haplotagged_bam_mosdepth_summary	mosdepth summary of median depths per chromosome
Array[File]	haplotagged_bam_mosdepth_region_bed	mosdepthhttps://github.com/brentp/mosdepth BED of median coverage depth per 500 bp window
Array[IndexData]	trgt_repeat_vcf	Tandem repeat genotypes from TRGT (with index)
Array[IndexData]	trgt_spanning_reads	Fragments of HiFi reads spanning loci genotyped by TRGT (with index)
Array[File]	trgt_dropouts	Regions with insufficient coverage to genotype by TRGT
Array[Array[File]]	cpg_pileup_beds	5mCpG site methylation probability pileups generated by pb-CpG-tools
Array[Array[File]]	cpg_pileup_bigwigs	5mCpG site methylation probability pileups generated by pb-CpG-tools
Array[File]	paraphase_output	Output generated by Paraphase
Array[IndexData]	paraphase_realigned_bam	Realigned BAM for selected medically relevant genes in segmental duplications (with index), generated by Paraphase
Array[Array[File]]	paraphase_vcfs	Phased Variant calls for selected medically relevant genes in segmental duplications, generated by Paraphase
Array[IndexData]	hificnv_vcfs	VCF output containing copy number variant calls for the sample from HiFiCNV
Array[File]	hificnv_copynum_bedgraphs	Copy number values calculated for each region
Array[File]	hificnv_depth_bws	Bigwig file containing the depth measurements from HiFiCNV
Array[File]	hificnv_maf_bws	Bigwig file containing the minor allele frequency measurements from DeepVariant, generated by HiFiCNV

Cohort analysis

These files will be output if the cohort includes more than one sample.

Type	Name	Description
IndexData?	cohort_small_variant_vcf	Small variants called by DeepVariant, joint-called by GLnexus, and phased by HiPhase (with index)
IndexData?	cohort_sv_vcf	Structural variants joint-called by pbsv and phased by HiPhase (with index)
File?	cohort_hiphase_stats	Phase block summary statistics written by HiPhase
File?	cohort_hiphase_blocks	Phase block list written by HiPhase

Tertiary analysis

These files will be output for each run of the workflow if run_tertiary_analysis is set to true. The files that are being annotated will depend on whether the number of samples is equal to or greater than one:

If the number of samples is equal to one, the files being annotated in this step are the sample small variant VCF and SV VCF.
If the number of samples is greater than one, the files being annotated in this step are the phased, joint-called small variant VCF and the cohort SV VCF.

Type	Name	Description
IndexData?	filtered_small_variant_vcf	Small variant calls that are filtered based on population frequency and annotated with cohort information, population frequency, gene, functional impact, etc., using slivar and `bcftools csq`
IndexData?	compound_het_small_variant_vcf	Compound heterozygotes annotated with cohort information, population frequency, gene, functional impact, etc., using slivar and `bcftools csq`
File?	filtered_small_variant_tsv	Filtered VCFs are reformatted as a human-readable TSV by `slivar tsv`
File?	compound_het_small_variant_tsv	Filtered VCFs are reformatted as a human-readable TSV by `slivar tsv`
IndexData?	filtered_svpack_vcf	Structural variant calls that are filtered based on population frequency and annotated with cohort information, population frequency, gene, functional impact, etc., using svpack
File?	filtered_svpack_tsv	Filtered VCFs are reformatted as a human-readable TSV by `slivar tsv`

Tool versions and Docker images

Docker images definitions used by this workflow can be found in the wdl-dockerfiles repository. Images are hosted in PacBio's quay.io. Docker images used in the workflow are pegged to specific versions by referring to their digests rather than tags.

The Docker image used by a particular step of the workflow can be identified by looking at the docker key in the runtime block for the given task. Images can be referenced in the following table by looking for the name after the final / character and before the @sha256:.... For example, the image referred to here is "align_hifiasm":

~{runtime_attributes.container_registry}/align_hifiasm@sha256:3968cb<...>b01f80fe

Image	Major tool versions	Links
bcftools	bcftools 1.14	Dockerfile
deepvariant	User-defined; default is version 1.5.0	DeepVariant GitHub
glnexus	glnexus v1.4.3	GLnexus GitHub
hificnv	HiFiCNV v0.1.7 bcftools 1.16	Dockerfile
hiphase	HiPhase 1.0.0 samtools 1.18 bcftools 1.18	Dockerfile
htslib	htslib 1.14	Dockerfile
mosdepth	mosdepth 0.2.9	Dockerfile
paraphase	minimap2 2.17 samtools 1.14 paraphase 2.2.3	Dockerfile
pb-cpg-tools	pb-CpG-tools v2.3.2	Dockerfile
pbmm2	pbmm2 1.10.0 datamash 1.1.0 pysam 0.16.0.1	Dockerfile
pbsv	pbsv 2.9.0 htslib 1.14	Dockerfile
pyyaml	pyyaml 5.3.1	Dockerfile
samtools	samtools 1.14	Dockerfile
slivar	slivar 0.2.2 bcftools 1.14 vcfpy 0.13.3 pysam 0.19.1	Dockerfile
svpack	svpack 36180ae6 htslib 1.18 pysam 0.21.0	Dockerfile
trgt	trgt 0.5.0 samtools 1.18 bcftools 1.18 pysam 0.21.0	Dockerfile

DISCLAIMER

TO THE GREATEST EXTENT PERMITTED BY APPLICABLE LAW, THIS WEBSITE AND ITS CONTENT, INCLUDING ALL SOFTWARE, SOFTWARE CODE, SITE-RELATED SERVICES, AND DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. ALL WARRANTIES ARE REJECTED AND DISCLAIMED. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THE FOREGOING. PACBIO IS NOT OBLIGATED TO PROVIDE ANY SUPPORT FOR ANY OF THE FOREGOING, AND ANY SUPPORT PACBIO DOES PROVIDE IS SIMILARLY PROVIDED WITHOUT REPRESENTATION OR WARRANTY OF ANY KIND. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A REPRESENTATION OR WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACBIO.

sunqiangzai/HiFi-human-WGS-WDL

PacBio WGS Variant Pipeline

Workflow

Setup

Resource requirements

Reference datasets and associated workflow files

Running the workflow

Selecting a backend

Configuring a workflow engine and container runtime

Filling out the inputs JSON

Running the workflow

Run directly using miniwdl

Run directly using Cromwell

Workflow inputs

Cohort

Sample

ReferenceData

SlivarData

Other inputs

Workflow outputs

Sample analysis

Cohort analysis

Tertiary analysis

Tool versions and Docker images

DISCLAIMER