cpsr: An R repository from ohofmann

Cancer Predisposition Sequencing Reporter (CPSR)

Overview

The Cancer Predisposition Sequencing Reporter (CPSR) is a computational workflow that interprets germline variants identified from next-generation sequencing in the context of cancer predisposition. The workflow is integrated with the framework that underlies the Personal Cancer Genome Reporter (PCGR), utilizing the Docker environment for encapsulation of code and software dependencies. While PCGR is intended for reporting and analysis of somatic variants detected in a tumor, CPSR is intended for reporting and ranking of germline variants in protein-coding genes that are implicated in cancer predisposition and inherited cancer syndromes.

CPSR accepts a query file with raw germline variant calls encoded in the VCF format (i.e. analyzing SNVs/InDels). Furthermore, through the use several different virtual cancer predisposition gene panels harvested from the Genomics England PanelApp, the user can flexibly put a restriction on which genes and findings are displayed in the cancer predisposition report.

The software performs extensive variant annotation on the selected geneset and produces an interactive HTML report, in which the user can investigate:

ClinVar variants - pre-classified variants according to a five-level tier scheme in ClinVar (Pathogenic to Benign)
Non-ClinVar variants - classified by CPSR through ACMG criteria (variant frequency levels and functional effects) into to a five-level tier scheme (Pathogenic to Benign)
Genomic biomarkers - cancer predisposition variants with reported implications for prognosis, diagnosis or therapeutic regimens
Secondary findings (optional) - pathogenic ClinVar variants in the ACMG recommended list for reporting of incidental findings
GWAS hits (optional) - variants overlapping with previously identified hits in genome-wide association studies (GWAS) of cancer phenotypes (i.e. low to moderate risk conferring alleles), using NHGRI-EBI Catalog of published genome-wide association studies as the underlying source.

The variant sets can be interactively explored and filtered further through different types of filters (phenotypes, genes, variant consequences, population MAF etc.). Importantly, the unclassified non-ClinVar variants are assigned a pathogenicity score based on the aggregation of scores according to previously established ACMG criteria. The ACMG criteria includes cancer-specific criteria, as outlined and specified in several previous studies (Huang et al., Cell, 2018; Nykamp et al., Genet Med., 2017; Maxwell et al., Am J Hum Genet., 2016; Amendola et al., Am J Hum Genet., 2016). See also Related work below).

Cancer predisposition genes

The cancer predisposition report can show variants found in a number of well-known cancer predisposition genes, and the specific set of genes can be customized by the user by choosing any of the following virtual gene panels (0 - 38):

Panel 0 (default) is a comprehensive, research-based gene panel assembled through known sources on cancer predisposition:
- A list of 152 genes that were curated and established within TCGA’s pan-cancer study (Huang et al., Cell, 2018)
- A list of 107 protein-coding genes that has been manually curated in COSMIC’s Cancer Gene Census v90,
- A list of 148 protein-coding genes established by experts within the Norwegian Cancer Genomics Consortium (http://cancergenomics.no)
The combination of the three sources resulted in a non-redundant set of 213 protein-coding genes of relevance for predisposition to tumor development.
- Panels 1 - 38 are panels for inherited cancer syndromes and cancer predisposition assembled within the Genomics England PanelApp:

Example report

Cancer predisposition sequencing report

Annotation resources included in cpsr - 0.5.2

VEP - Variant Effect Predictor v98.3 (GENCODE v31/v19 as the gene reference dataset), includes gnomAD r2.1, dbSNP build 152/152, 1000 Genomes Project - phase3
ClinVar - Database of variants with clinical significance (November 2019)
CIViC - clinical interpretations of variants in cancer (November 5th 2019)
Cancer Hotspots - Resource for statistically significant mutations in cancer (v2 - 2017)
dBNSFP - Database of non-synonymous functional predictions (v4.0, May 2019)
UniProt/SwissProt KnowledgeBase - Resource on protein sequence and functional information (2019_10, November 2019)
Pfam - Database of protein families and domains (v32, Sep 2018)
CancerMine - Literature-derived database of tumor suppressor genes/proto-oncogenes (v18, November 2019)
GenomicsEngland PanelApp - panels as of November 16th 2019
NHGRI-EBI GWAS catalog - GWAS catalog for cancer phenotypes, October 14th 2019)

Documentation

IMPORTANT: If you use CPSR, please cite the following preprint:

Sigve Nakken, Vladislav Saveliev, Oliver Hofmann, Pål Møller, Ola Myklebost, and Eivind Hovig. Cancer Predisposition Sequencing Reporter: a flexible variant report engine for germline screening in cancer (2019). bioRxiv. doi:10.1101/846089

News

November 18th 2019: 0.5.2 release
- Updated bundle (ClinVar, CancerMine, UniProtKB, Genomics England PanelApp)
- CHANGELOG
October 13th 2019: 0.5.1 release
- Updated software (VEP 98.2)
- Updated bundle (ClinVar, CancerMine, Genomics England PanelApp (36 panels))
October 4th 2019:
- By mistake, the recently updated grch38 data bundle (20190927) is missing a critical part for CPSR processing. Please download the missing files using this link, and put the contents underneath data/grch38/gnomad_cpsr/

Getting started

STEP 0: Install PCGR (version 0.8.4)

Make sure you have a working installation of PCGR (version 0.8.4) and the accompanying data bundle(s) (walk through steps 0-2).

STEP 1: Download the latest release

Download the 0.5.2 release of cpsr (run script and configuration file)

STEP 2: Configuration

A few elements of the workflow can be figured using the cpsr configuration file, encoded in TOML. The following can be configured:

Choice of gnomAD control population
Upper MAF limit for unclassified variants to be included in the report
Inclusion of GWAS hits
Inclusion of CPSR-based ACMG classifications for ClinVar variants
Inclusion of secondary findings
VEP/vcfanno options

See section on Input for more details wrt. default configuration.

STEP 3: Run example

Cancer Predisposition Sequencing Reporter (CPSR) - report of cancer-predisposing germline variants

positional arguments:
query_vcf             VCF input file with germline query variants (SNVs/InDels).
pcgr_base_dir         Directory that contains the PCGR data bundle directory, e.g. ~/pcgr-0.8.4
output_dir            Output directory
{grch37,grch38}       Genome assembly build: grch37 or grch38
configuration_file    Configuration file (TOML format)
sample_id             Sample identifier - prefix for output files

optional arguments:
-h, --help            show this help message and exit
--force_overwrite     By default, the script will fail with an error if any output file already exists.
				You can force the overwrite of existing result files by using this flag
--version             show program's version number and exit
--basic               Run functional variant annotation on VCF through VEP/vcfanno, omit report generation (STEP 4)
--panel_id VIRTUAL_PANEL_ID
			    Identifier for choice of predefined virtual cancer predisposition gene panels,
				choose any between the following identifiers:
			    0 = CPSR exploratory cancer predisposition panel (n = 213, TCGA + Cancer Gene Census + NCGC)
			    1 = Adult solid tumours cancer susceptibility (Genomics England PanelApp)
			    2 = Adult solid tumours for rare disease (Genomics England PanelApp)
			    3 = Bladder cancer pertinent cancer susceptibility (Genomics England PanelApp)
			    4 = Brain cancer pertinent cancer susceptibility (Genomics England PanelApp)
			    5 = Breast cancer pertinent cancer susceptibility (Genomics England PanelApp)
			    6 = Childhood solid tumours cancer susceptibility (Genomics England PanelApp)
			    7 = Colorectal cancer pertinent cancer susceptibility (Genomics England PanelApp)
			    8 = Endometrial cancer pertinent cancer susceptibility (Genomics England PanelApp)
			    9 = Familial Tumours Syndromes of the central & peripheral Nervous system (Genomics England PanelApp)
			    10 = Familial breast cancer (Genomics England PanelApp)
			    11 = Familial melanoma (Genomics England PanelApp)
			    12 = Familial prostate cancer (Genomics England PanelApp)
			    13 = Familial rhabdomyosarcoma (Genomics England PanelApp)
			    14 = GI tract tumours (Genomics England PanelApp)
			    15 = Genodermatoses with malignancies (Genomics England PanelApp)
			    16 = Haematological malignancies cancer susceptibility (Genomics England PanelApp)
			    17 = Haematological malignancies for rare disease (Genomics England PanelApp)
			    18 = Head and neck cancer pertinent cancer susceptibility (Genomics England PanelApp)
			    19 = Inherited non-medullary thyroid cancer (Genomics England PanelApp)
			    20 = Inherited ovarian cancer (without breast cancer) (Genomics England PanelApp)
			    21 = Inherited pancreatic cancer (Genomics England PanelApp)
			    22 = Inherited renal cancer (Genomics England PanelApp)
			    23 = Inherited phaeochromocytoma and paraganglioma (Genomics England PanelApp)
			    24 = Melanoma pertinent cancer susceptibility (Genomics England PanelApp)
			    25 = Multiple endocrine tumours (Genomics England PanelApp)
			    26 = Multiple monogenic benign skin tumours (Genomics England PanelApp)
			    27 = Neuroendocrine cancer pertinent cancer susceptibility (Genomics England PanelApp)
			    28 = Neurofibromatosis Type 1 (Genomics England PanelApp)
			    29 = Ovarian cancer pertinent cancer susceptibility (Genomics England PanelApp)
			    30 = Parathyroid Cancer (Genomics England PanelApp)
			    31 = Prostate cancer pertinent cancer susceptibility (Genomics England PanelApp)
			    32 = Renal cancer pertinent cancer susceptibility (Genomics England PanelApp)
			    33 = Rhabdoid tumour predisposition (Genomics England PanelApp)
			    34 = Sarcoma cancer susceptibility (Genomics England PanelApp)
			    35 = Sarcoma susceptibility (Genomics England PanelApp)
			    36 = Thyroid cancer pertinent cancer susceptibility (Genomics England PanelApp)
			    37 = Tumour predisposition - childhood onset (Genomics England PanelApp)
			    38 = Upper gastrointestinal cancer pertinent cancer susceptibility (Genomics England PanelApp)

--custom_panel TARGET_BED
			    Define custom screening panel through a three-column BED file (alternative to predefined panels provided with --panel_id)
--no_vcf_validate     Skip validation of input VCF with Ensembl's vcf-validator
--diagnostic_grade_only
			    For Genomics England virtual predisposition panels - consider genes with a GREEN status only
--docker-uid DOCKER_USER_ID
			    Docker user ID. Default is the host system user ID. If you are experiencing permission errors,
				try setting this up to root (`--docker-uid root`)
--no-docker           Run the CPSR workflow in a non-Docker mode (see install_no_docker/ folder for instructions
--debug               Print full docker commands to log

The cpsr software bundle contains an example VCF file. It also contains a configuration file (cpsr.toml).

Report generation with the example VCF, using the Adult solid tumours cancer susceptibility virtual gene panel, can be performed through the following command:

python ~/cpsr-0.5.2/cpsr.py ~/cpsr-0.5.2/example.vcf.gz ~/pcgr-0.8.4 ~/cpsr-0.5.2 grch37 --panel_id 1 ~/cpsr-0.5.2/cpsr.toml example

Note that the example command also refers to the PCGR directory (pcgr-0.8.4), which contains the data bundle that are necessary for both PCGR and CPSR.

This command will run the Docker-based cpsr workflow and produce the following output files in the cpsr folder:

example.cpsr.grch37.pass.vcf.gz (.tbi) - Bgzipped VCF file with relevant annotations appended by CPSR
example.cpsr.grch37.pass.tsv.gz - Compressed TSV file (generated with vcf2tsv) of VCF content with relevant annotations appended by CPSR
example.cpsr.grch37.html - Interactive HTML report with clinically relevant variants in cancer predisposition genes organized into tiers
example.cpsr.grch37.json.gz - Compressed JSON dump of HTML report content
example.cpsr.snvs_indels.tiers.grch37.tsv - TSV file with most important annotations of tier-structured SNVs/InDels

Related work

Contact

sigven AT ifi.uio.no

ohofmann/cpsr