/HiFi-human-WGS-WDL

Primary LanguageWDLBSD 3-Clause Clear LicenseBSD-3-Clause-Clear

DISCLAIMER

TO THE GREATEST EXTENT PERMITTED BY APPLICABLE LAW, THIS WEBSITE AND ITS CONTENT, INCLUDING ALL SOFTWARE, SOFTWARE CODE, SITE-RELATED SERVICES, AND DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. ALL WARRANTIES ARE REJECTED AND DISCLAIMED. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THE FOREGOING. PACBIO IS NOT OBLIGATED TO PROVIDE ANY SUPPORT FOR ANY OF THE FOREGOING, AND ANY SUPPORT PACBIO DOES PROVIDE IS SIMILARLY PROVIDED WITHOUT REPRESENTATION OR WARRANTY OF ANY KIND. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A REPRESENTATION OR WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACBIO.

PacBio WGS Variant Pipeline

Workflow for analyzing human PacBio whole genome sequencing (WGS) data using Workflow Description Language (WDL).

Workflow

Workflow entrypoint: workflows/main.wdl

PacBio WGS Variant Pipeline performs read alignment, variant calling, and phasing. Joint-calling of small variants and structural variants for cohorts and optional variant filtering and annotation is also available for HiFi human WGS. The workflow can run using Azure, AWS, GCP, and HPC backends.

PacBio WGS Variant Pipeline diagram

Setup

We recommend cloning the repo rather than downloading the release package. Some tasks and workflows are pulled in from other repositories. Ensure you have initialized submodules following cloning by running git submodule update --init --recursive.

Resource requirements

The workflow requires at minimum 64 cores and 256 GB of RAM. Ensure that the backend environment you're using has enough quota to run the workflow.

Reference datasets and associated workflow files

Reference datasets are hosted publicly for use in the pipeline. For data locations, see the backend-specific documentation and template inputs files for each backend with paths to publicly hosted reference files filled out.

Running the workflow

  1. Select a backend environment
  2. Configure a workflow execution engine in the chosen environment
  3. Fill out the inputs JSON file for your cohort
  4. Run the workflow

Selecting a backend

The workflow can be run on Azure, AWS, GCP, or HPC. Your choice of backend will largely be determined by the location of your data.

For backend-specific configuration, see the relevant documentation:

Configuring a workflow engine

An execution engine is required to run workflows. Two popular engines for running WDL-based workflows are miniwdl and Cromwell.

See the backend-specific documentation for details on setting up an engine.

Engine Azure AWS GCP HPC
miniwdl Unsupported Supported via the Amazon Genomics CLI Unsupported (SLURM only) Supported via the miniwdl-slurm plugin
Cromwell Supported via Cromwell on Azure Supported via the Amazon Genomics CLI Supported via Google's Pipelines API Supported - Configuration varies depending on HPC infrastructure

Filling out the inputs JSON

The input to a workflow run is defined in JSON format. Template input files with reference dataset information filled out are available for each backend:

Using the appropriate inputs template file, fill in the cohort and sample information (see Workflow Inputs for more information on the input structure).

If using an HPC backend, you will need to download the reference bundle and replace the <local_path_prefix> in the input template file with the local path to the reference datasets on your HPC.

Running the workflow

Run the workflow using the engine and backend that you have configured (miniwdl, Cromwell).

Note that the calls to miniwdl and Cromwell assume you are accessing the engine directly on the machine on which it has been deployed. Depending on the backend you have configured, you may be able to submit workflows using different methods (e.g. using trigger files in Azure, or using the Amazon Genomics CLI in AWS).

Run directly using miniwdl

miniwdl run workflows/main.wdl -i <input_file_path.json>

Run directly using Cromwell

java -jar <cromwell_jar_path> run workflows/main.wdl -i <input_file_path.json>

If Cromwell is running in server mode, the workflow can be submitted using cURL. Fill in the values of CROMWELL_URL and INPUTS_JSON below, then from the root of the repository, run:

# The base URL (and port, if applicable) of your Cromwell server
CROMWELL_URL=
# The path to your inputs JSON file
INPUTS_JSON=

(cd workflows && zip -r dependencies.zip humanwgs_structs.wdl  cohort_analysis/ sample_analysis/ tertiary_analysis/ wdl-common/)
curl -X "POST" \
  "${CROMWELL_URL}/api/workflows/v1" \
  -H "accept: application/json" \
  -H "Content-Type: multipart/form-data" \
  -F "workflowSource=@workflows/main.wdl" \
  -F "workflowInputs=@${INPUTS_JSON};type=application/json" \
  -F "workflowDependencies=@workflows/dependencies.zip;type=application/zip"

To specify workflow options, add the following to the request (assuming your options file is a file called options.json located in the pwd): -F "workflowOptions=@options.json;type=application/json".

Workflow inputs

This section describes the inputs required for a run of the workflow. Typically, only the humanwgs.cohort and potentially run/backend-specific sections will be filled out by the user for each run of the workflow. Input templates with reference file locations filled out are provided for each backend.

A cohort can include one or more samples. Samples need not be related, but if you plan to run tertiary analysis, it is best to think of a cohort as a family of related samples. We have tested cohorts of up to 5 samples with 30x coverage. Larger cohorts may encounter memory issues during joint calling.

Type Name Description Notes
String cohort_id A unique name for the cohort; used to name outputs
Array[Sample] samples The set of samples for the cohort. At least one sample must be defined.
Array[String] phenotypes Human Phenotype Ontology (HPO) phenotypes associated with the cohort. If no particular phenotypes are desired, the root HPO term, "HP:0000001", can be used.

Sample information for each sample in the workflow run.

Type Name Description Notes
String sample_id A unique name for the sample; used to name outputs
Array[IndexData] movie_bams The set of unaligned movie BAMs associated with this sample
String? sex Sample sex ["MALE", "FEMALE", null]. If the sex field is missing or null, sex will be set to unknown. Used to set the expected sex chromosome karyotype for TRGT and HiFiCNV.
Boolean affected Is this sample affected by the phenotype? [true, false]
String? father_id Paternal sample_id
String? mother_id Maternal sample_id

Files associated with the reference genome.

These files are hosted publicly in each of the cloud backends; see backends/${backend}/inputs.${backend}.json.

Type Name Description Notes
String name Reference name; used to name outputs (e.g., "GRCh38") Note: The workflow currently only supports GRCh38 and provides GCA_000001405.15_GRCh38_no_alt_analysis_set.
IndexData fasta Reference genome and index
File tandem_repeat_bed Tandem repeat locations used by pbsv to normalize SV representation
File trgt_tandem_repeat_bed Tandem repeat sites to be genotyped by TRGT
IndexData hificnv_exclude_bed Compressed BED and index of regions to exclude from calling by HiFiCNV. We recommend cnv.excluded_regions.common_50.hg38.bed.gz.
File hificnv_expected_bed_male BED of expected copy number for male karyotype for HiFiCNV
File hificnv_expected_bed_female BED of expected copy number for female karyotype for HiFiCNV
File? gnomad_af gnomAD v3.1 allele frequences in slivar gnotate format required if run_tertiary_analysis is set to true
File? hprc_af Allele frequences in ~100 Human Pangenome Reference Consortium (HPRC) samples in slivar gnotate format required if run_tertiary_analysis is set to true
File? gff Ensembl GFF3 reference annotation required if run_tertiary_analysis is set to true
Array[IndexData?] population_vcfs An array of structural variant population VCFs required if run_tertiary_analysis is set to true

Files associated with slivar annotation. These are required if run_tertiary_analysis is set to true.

These files are hosted publicly in each of the cloud backends; see backends/${backend}/inputs.${backend}.json.

Type Name Description Notes
File slivar_js Additional javascript functions for slivar
File hpo_terms HPO annotation lookups
File hpo_dag HPO annotation lookups
File hpo_annotations HPO annotation lookups
File ensembl_to_hgnc Ensembl to HGNC gene mapping
File lof_lookup Loss-of-function scores per gene
File clinvar_lookup ClinVar annotations per gene

Other inputs

Type Name Description Notes
String? deepvariant_version Version of deepvariant to use ["1.5.0"]
DeepVariantModel? deepvariant_model Optional alternate DeepVariant model file to use
Int? pbsv_call_mem_gb Optionally set RAM (GB) for pbsv_call during cohort analysis
Int? glnexus_mem_gb Optionally set RAM (GB) for GLnexus during cohort analysis
Boolean? run_tertiary_analysis Run the optional tertiary analysis steps [false] [true, false]
String backend Backend where the workflow will be executed ["Azure", "AWS", "GCP", "HPC"]
String? zones Zones where compute will take place; required if backend is set to 'AWS' or 'GCP'.
String? aws_spot_queue_arn Queue ARN for the spot batch queue; required if backend is set to 'AWS' and preemptible is set to true Determining the AWS queue ARN
String? aws_on_demand_queue_arn Queue ARN for the on demand batch queue; required if backend is set to 'AWS' and preemptible is set to false Determining the AWS queue ARN
String? container_registry Container registry where workflow images are hosted. If left blank, PacBio's public Quay.io registry will be used.
Boolean preemptible If set to true, run tasks preemptibly where possible. On-demand VMs will be used only for tasks that run for >24 hours if the backend is set to GCP. If set to false, on-demand VMs will be used for every task. Ignored if backend is set to HPC. [true, false]

Workflow outputs

Sample analysis

These files will be output for each sample defined in the cohort.

Type Name Description Notes
Array[Array[File]] bam_stats TSV of length and quality for each read (per input BAM)
Array[Array[File]] read_length_summary For each input BAM, read length distribution (per input BAM)
Array[Array[File]] read_quality_summary For each input BAM, read quality distribution (per input BAM)
Array[IndexData] small_variant_gvcfs Small variants (SNPs and INDELs < 50bp) gVCFs called by DeepVariant (with index)
Array[File] small_variant_vcf_stats bcftools stats summary statistics for small variants
Array[File] small_variant_roh_out Output of bcftools roh using --AF-dflt 0.4
Array[File] small_variant_roh_bed Regions of homozygosity determiend by bcftools roh using --AF-dflt 0.4
Array[IndexData] sample_phased_small_variant_vcfs Small variants called by DeepVariant and phased by HiPhase (with index)
Array[IndexData] sample_phased_sv_vcfs Structural variants called by pbsv and phased by HiPhase (with index)
Array[File] sample_hiphase_stats Phase block summary statistics written by HiPhase
Array[File] sample_hiphase_blocks Phase block list written by HiPhase
Array[File] sample_hiphase_haplotags Per-read haplotag information, written by HiPhase
Array[IndexData] merged_haplotagged_bam Aligned (by pbmm2), haplotagged (by HiPhase) reads (with index)
Array[File] haplotagged_bam_mosdepth_summary mosdepth summary of median depths per chromosome
Array[File] haplotagged_bam_mosdepth_region_bed mosdepthhttps://github.com/brentp/mosdepth BED of median coverage depth per 500 bp window
Array[IndexData] trgt_repeat_vcf Tandem repeat genotypes from TRGT (with index)
Array[IndexData] trgt_spanning_reads Fragments of HiFi reads spanning loci genotyped by TRGT (with index)
Array[File] trgt_dropouts Regions with insufficient coverage to genotype by TRGT
Array[Array[File]] cpg_pileup_beds 5mCpG site methylation probability pileups generated by pb-CpG-tools
Array[Array[File]] cpg_pileup_bigwigs 5mCpG site methylation probability pileups generated by pb-CpG-tools
Array[File] paraphase_output Output generated by Paraphase
Array[IndexData] paraphase_realigned_bam Realigned BAM for selected medically relevant genes in segmental duplications (with index), generated by Paraphase
Array[Array[File]] paraphase_vcfs Phased Variant calls for selected medically relevant genes in segmental duplications, generated by Paraphase
Array[IndexData] hificnv_vcfs VCF output containing copy number variant calls for the sample from HiFiCNV
Array[File] hificnv_copynum_bedgraphs Copy number values calculated for each region
Array[File] hificnv_depth_bws Bigwig file containing the depth measurements from HiFiCNV
Array[File] hificnv_maf_bws Bigwig file containing the minor allele frequency measurements from DeepVariant, generated by HiFiCNV

Cohort analysis

These files will be output if the cohort includes more than one sample.

Type Name Description Notes
IndexData? cohort_small_variant_vcf Small variants called by DeepVariant, joint-called by GLnexus, and phased by HiPhase (with index)
IndexData? cohort_sv_vcf Structural variants joint-called by pbsv and phased by HiPhase (with index)
File? cohort_hiphase_stats Phase block summary statistics written by HiPhase
File? cohort_hiphase_blocks Phase block list written by HiPhase

Tertiary analysis

These files will be output for each run of the workflow if run_tertiary_analysis is set to true. The files that are being annotated will depend on whether the number of samples is equal to or greater than one:

  • If the number of samples is equal to one, the files being annotated in this step are the sample small variant VCF and SV VCF.
  • If the number of samples is greater than one, the files being annotated in this step are the phased, joint-called small variant VCF and the cohort SV VCF.
Type Name Description Notes
IndexData? filtered_small_variant_vcf Small variant calls that are filtered based on population frequency and annotated with cohort information, population frequency, gene, functional impact, etc., using slivar and bcftools csq
IndexData? compound_het_small_variant_vcf Compound heterozygotes annotated with cohort information, population frequency, gene, functional impact, etc., using slivar and bcftools csq
File? filtered_small_variant_tsv Filtered VCFs are reformatted as a human-readable TSV by slivar tsv
File? compound_het_small_variant_tsv Filtered VCFs are reformatted as a human-readable TSV by slivar tsv
IndexData? filtered_svpack_vcf Structural variant calls that are filtered based on population frequency and annotated with cohort information, population frequency, gene, functional impact, etc., using svpack
File? filtered_svpack_tsv Filtered VCFs are reformatted as a human-readable TSV by slivar tsv

Tool versions and Docker images

Docker images definitions used by this workflow can be found in the wdl-dockerfiles repository. Images are hosted in PacBio's quay.io. Docker images used in the workflow are pegged to specific versions by referring to their digests rather than tags.

The Docker image used by a particular step of the workflow can be identified by looking at the docker key in the runtime block for the given task. Images can be referenced in the following table by looking for the name after the final / character and before the @sha256:.... For example, the image referred to here is "align_hifiasm":

~{runtime_attributes.container_registry}/align_hifiasm@sha256:3968cb<...>b01f80fe

Image Major tool versions Links
bcftools Dockerfile
deepvariant User-defined; default is version 1.5.0 DeepVariant GitHub
glnexus GLnexus GitHub
hificnv Dockerfile
hiphase Dockerfile
htslib Dockerfile
mosdepth Dockerfile
paraphase Dockerfile
pb-cpg-tools Dockerfile
pbmm2 Dockerfile
pbsv Dockerfile
pyyaml Dockerfile
samtools Dockerfile
slivar Dockerfile
svpack Dockerfile
trgt Dockerfile