/Denovo_SNV_Analysis

De novo SNV analysis pipeline

Primary LanguageRMIT LicenseMIT

De novo SNV analysis pipeline

This is a pipeline for the identification, characterization and prioritization of Denovo autosomal SNVs in whole genome sequencing (WGS) and RNA-sequencing data for patients with ID and / or congenital malformations. The focus of this pipeline is on the non-exonic SNVs. Raw WGS sequencing data needs to be processed with the UMCU IAP pipeline before it can be used in this pipeline. The raw RNA-sequencing data needs to be processed by the UMCU RNASeq pipeline, before it can be used. For questions please feel free to contact me at: freek.manders@gmail.com. This pipeline was developed during my master internship in the Cuppen group at the Utrecht UMC.
Note: This pipeline has many dependencies, which makes it difficult to run as is. However, some of the strategies and code used in this pipeline can be useful for people working on non-exonic SNVs.

USAGE:

This pipeline works on the GRCh37 genome assembly. To run the pipeline:

python Denovo_snv_analysis.py

Settings can be changed in the Denovo_snv_analysis.ini file. To run the pipeline with a different ini file:

python Denovo_snv_analysis.py --ini myini.ini

The inputs and outputs of the different scripts are visualized in overview graphs.
This pipeline is designed to run on a Sun Grid Engine system. To run the pipeline on a different system, you need to change the Denovo_snv_analysis.py script.

Dependencies:

Core tools

  • Sun Grid Engine
  • Python >= 2.7.10
  • R >= 3.4.1 (Bioconductor >= 3.6)

Bio tools loaded as modules in the Sun Grid Engine

Other Bio tools

Python packages

  • argparse
  • os
  • sys
  • re
  • datetime
  • timeit
  • numpy
  • pandas
  • multiprocessing
  • subprocess
  • signal
  • collections
  • shutil

R packages

  • biomaRt
  • BSgenome.Hsapiens.UCSC.hg19
  • cba
  • circlize
  • ComplexHeatmap
  • dplyr
  • gdata
  • GenomicRanges
  • ggplot2
  • ggpubr
  • ggrepel
  • gridExtra
  • Gviz
  • karyoploteR
  • MutationalPatterns
  • optparse
  • plyr
  • RColorBrewer
  • reshape2
  • rtracklayer
  • scales
  • stringr
  • VariantAnnotation
  • VennDiagram

Datasets used when running the pipeline. Some of these may require some pre-processing

  • GRCh37
  • DANN scores
  • gnomAD
  • HMF-PON (Contact the HMF for acces to the data) (Tested on version 2.0)
  • ROADMAP 15-state model (The coreMarks mnemonics.bed files for samples: E029, E070, E071, E081 and E082 are needed. Tested on the version from 11-10-2013)
  • Ensembl Regulatory Builld (Download with biomart as: Ensembl Regulation: Human Regulatory Features. Tested on version 92. Make sure to download the version for GRCh37)
  • phastCons46way (Download the .wigFix.gz files. Tested on the version from 10-11-2009)
  • Age_parents (A tab delimited file with the format: sample_id, age_father, age_mother)
  • Breakpoints (A file containing breakpoints of the SVs identified in this sample. For scripts to identify these SVs please contact me.)
  • Cancer signatures (Tested on the version from 25-04-2018)
  • Output data from Alissa Interpret. (This is only necessary when comparing the results of the pipeline to cartagenia
  • A gene list generated by a script from the Cuppen group. Please contact me if you wish to use this script.
  • An excell file containing the phenotypes of the patient. The first column should contain the patients ID, the third column should contain the phenotypes as HPO terms seperated by a ",".
  • Genes_to_phenotype (Tested on the all_sources_all_frequencies version from 11-12-2017.)
  • HPO OBO (Tested on the version from 11-12-2017.)
  • Entrez gene list
  • PCHiC data. Tested with the datasets from Javierre BM et al., Paula Freire-Pritchett et al., Cairns J et al. and Adam J Rubin et al.