/IBDVar

A prototype tool for prioritising identity-by-descent (IBD) variants in Whole Genome Sequencing (WGS) data from families with rare heritable diseases.

Primary LanguageShellMIT LicenseMIT

IBDVar

A tool for prioritising identity-by-descent (IBD) variants in Whole Genome Sequencing (WGS) data from families with rare heritable diseases. IBDVar consists of a variant prioritisation pipeline command-line program and an intereactive Shiny dashboard for starting the pipeline and visualising output.

Table of Contents

Overview

The use of IBDVar follows a three step process:

The prioritisation pipeline is composed of two sub-pipelines (short variants and structural variants (SV)) that are started independently. Users can upload a multi-sample VCF file and configure the short variants or structural variants prioritisation pipeline in the Shiny dashboard or run the pipelines on a multi-sample VCF file at the command line using a configuration file. Once the pipeline has completed the output can be explored interactively in the corresponding pipeline tab in the Shiny dashboard. Unique to the tool, is the integration of IBD segment detection in variant prioritisation for WGS data. An overview of the key steps is shown below.

System Requirements

For running the bash pipeline backend:

For the deploying the shiny dashboard, the following R dependencies are required:

  • shiny
  • shinydashboard
  • shinyFiles
  • shinyJS
  • htmlwidgets
  • dplyr
  • jsonlite
  • purrr
  • readxl
  • DT
  • ideogram
  • reshape2

To install these R packages, type the following in an R console:

install.packages(c("shiny", "shinydashboard", "shinyFiles", "shinyJS", "htmlwidgets", "dplyr", "jsonlite", "purrr", "readxl", "DT", "reshape2"))

To install the ideogram library, find the path of ideogram tarball file (.tar.gz) and type:

install.packages("path/to/ideogram_0.0.0.9000.tar.gz", type="source", repos=NULL)

Variant Priorisation Pipelines

IBDVar can prioritise both short variants and structural variants (SV) from multi-sample VCF files generated from the Illumina DRAGEN Pipeline. Both prioritisation pipelines can be initiated from the command-line or inside the Shiny dashboard "Start pipeline" tab.

Short Variants (Command line start)

Input VCF file

A multi-sample VCF file contained short variants (indels/ SNPs) called from the Illumina DRAGEN pipeline is used as input (see the Illumina website for details). The VCF file format should adhere to version 4.2 specification. The pipeline expects chromosome naming to be prefixed with "chr" however, the tool checks for naming consistencies between the input VCF and the annotation resources implemented in the pipeline.

Configuration Parameters

To run the short variants pipeline at the command line, you will need to create a configuration file with parameters (with "=" separating the parameter and its value) described in the table below:

Category

Configuration parameter

Description

General settings

in_vcf

An input file path for the small variants VCF produced from Illumina DRAGEN Germline Pipeline.

 

out_dir

An output directory path location to generate pipeline output

 

threads

The number of threads (CPU) for executing the pipeline (default: 4)

QC filtering

GQ

Minimum genotype quality threshold for each sample (default: 20)

 

DP

Minimum (FORMAT) read depth threshold per sample (default:10)

 

MAF

Minimum allele frequency for variants to be selected for the PLINK dataset

IBD detection

mind

Maximum percentage of missing genotype data e.g., 0.1 excludes samples with > 10% missing genotype data (default: 0.1)

 

geno

Select variants with missing calling rates lower than the provided value (default: 0.1)

 

max_af

Maximum allele frequency threshold for rare variants from the gnomAD, ESP or 100 genomes project populations. (Default: 0.05)

 

 

ibis_mt1

Minimum number of markers for IBIS to call a segment IBD1

 

ibis_mt2

Minimum number of markers for IBIS to call a segment IBD2

 

genes

A list of genes of interest for selecting variants in specified genes (optional)

Tools

tools_dir

Optional tools base directory path for tools required by the pipeline

 

plink

PLINK2 directory path

 

vep

Vep executable file path

 

ibis

Ibis directory path

Resources

resources

Optional base directory path for resources

 

clinvar

ClinVar VCF file path

 

genetic_map

The file path for the genetic recombination map for the human genome

 

cadd

CADD plugin resource directory path

Click here for an example of a short variants config file.

Using a screen to run the short variants pipeline

As the short variants pipeline can take a few hours to complete, it is highly recommended to run the pipeline in a Linux GNU screen to prevent abrupt termination of the pipeline, for example, in the event of a connection drop or a sudden SSH session termination. To install Linux GNU Screen on Ubuntu / Debian systems:

sudo apt update
sudo apt install screen 

On CentOS/Fedora type:

sudo yum install screen

To create a screen type screen in the terminal, or create a named screen by typing the following:

screen -S <screen_name>

Attach the screen to the terminal as follows:

screen -r <screen_name>

Once the screen is attached execute the pipeline as described in the usage section below.

After attaching the screen to the terminal and initiating the short variants pipeline, detach the screen by pressing CtrA and d, or typing in the terminal:

screen -d <screen_name>

This will allow exiting of the terminal window without terminating the pipeline. To reattach the screen, simply type in the terminal:

screen -r <screen_name>

Usage

./short_variants.sh -c pipeline.config [-m in_vcf.md5sum ]
Options:
  • -c: config file (ending with .config) containing all parameters to execute the pipeline (required)
  • -m: md5sum file to perform and and md5sum check on the input VCF file specified in the config file
  • -h: help message with usage details and options

Structural Variants (Command line start)

Input VCF File

A multi-sample VCF file contained structural variants called (using Manta) from the Illumina DRAGEN pipeline is used as input (see the Illumina website for details). The VCF file format should adhere to version 4.2 specification. The pipeline expects chromosome naming to be prefixed with "chr" however, the tool checks for naming consistencies between the input VCF and the annotation resources implemented in the pipeline.

Configuration File

To start the structural variants pipeline at the command line, you will need to create a configuration file using the parameters specified in the table below:

Category

Configuration parameter

Description

General settings

sv_vcf

Input VCF file path

 

out_dir

Directory path for pipeline output

 

threads

Number of threads (CPU)

Variant selection

ibd_seg

IBD segment file path (from the short variants pipeline) for selecting SV in IBD segments.

 

genes

A list of genes of interest to be used to filter variants.

Tools

tools_dir

(Optional) base directory for tools

Resources

resources

The base directory for resources (optional)

 

ccds

CCDS directory path

Click here for an example of a structural variants config file.

Usage

./structural_variants.sh -c pipeline.config
Options:
  • -c: config file (ending with .config) containing all parameters to execute the pipeline (required)
  • -h: help message with usage details and options

Shiny Dashboard

The shiny dashboard allows users to start prioritisation pipelines for short or strucutral variants and to analyse the output interactively.

To start the Shiny Dashboard in the Cranfield Univeristy server, log into the Linux server deploying the tool and type the application URL (can be requested from the author) in a web-browser. (Note that development and testing was performed using the Google Chrome browser so performance may vary with other browsers.) The shiny dashboard can also be started in RStudio however it is not recommended, since most of views have been configured for browser display and may affect performance of the tool.

The "Start Pipeline" tab will be open first by default.

Start Pipeline

In the "Start Pipeline" tab you can start the short variants or structural variants pipeline by selecting an input VCF file, output folder for results and configuring parameters listed in the respective pipeline box.

Once parameters have been specified, click Start in the respective pipeline box to run the pipeline. A notification message should appear in the bottom right corner indicating pipeline initiation.

start_pipeline

Short Variants

In the "Short Variants" tab you can explore the short variants pipeline output interactively.
The tab features:

  • a "Files" box to upload the following files which are located in the "final_output" folder of the output folder specified at run-time of the pipeline:

    1. A prioritised and annotated list of variants produced from the short variants prioritisation pipeline.
    2. An IBIS IBD segment file produced from the pipeline
    • An optional file containing list of genes of interest can also be uploaded to filter the variants by these genes
  • Interactive variants table - users can filter, sort, search and download a TSV file of variants reported in the table.

  • Filters panel - contains a series of checkboxes to filter variants by CADD score, predicted consequence, SIFT and PolyPhen calls, clinical significance (ClinVar) and VEP predicted impact (loss of function etc.)

  • Interactive ideogram - filters variants in the interactive data table below by the IBD region clicked by the user. A tool-tip reporting the chromosome number, start and end position of a given IBD region is displayed when a user hovers over an IBD region.

    short_variants_tab

  • "Summary" box summarising:

    • total number of variants
    • number of pathogenic variants identified by ClinVar
    • number of detected IBD segments, the total number of deleterious missense variants predicted by SIFT, PolyPhen and CADD
    • number of loss of function variants

Structural Variants

In the "Structural Variants" tab, the prioritised SV calls from the pipeline can be interactively explored using filters and an interactive data table.

SV tab features include:

  • "Files" box for uploading the prioritised list of SV calls (.tsv) file
  • Interactive table of variants that can filtered, sorted, searched and downloaded as a TSV file.
  • "Summary" tab providing summary statistics on the various counts of SV types and also the mean SV lengths.
  • Filters panel containing checkboxes to filter the variants table by: SV type, chromosome number, precision of breakpoints of called SVs and genes of interest.

    sv_tab (1)

Questions, Feature Requests, Bug Reports and Issues

For any questions, feature requests, bug reports or issues regarding the latest version of IBDVar, please click on the "issues" tab present at the top-left of the GitHub repository page.

Licence

MIT

Collaborators

This codebase was developed as part of an MSc thesis project (MSc Applied Bioinformatics, Cranfield University 2021-2022) under the supervision of Dr Alexey Larionov.