/HMZDelFinder

CNV calling algorithm for detection of homozygous and hemizygous deletions from whole exome sequencing data

Primary LanguageRGNU General Public License v2.0GPL-2.0

HMZDelFinder

CNV calling algorithm for detection of rare, homozygous and hemizygous deletions from whole exome sequencing data

Prerequisites

  • R in version >= 3.0.1

Following R libraries are required to run HMZDelFinder:

  • RCurl (version >= 1.95.4.7)
  • gdata (version >= 2.17.0)
  • data.table (version >= 1.9.6)
  • DNAcopy (version >= 1.36.0)
  • GenomicRanges (version >= 1.14.4)
  • parallel (version >= 3.0.1)
  • Hmisc (version >= 3.16.0)
  • matrixStats (version >= 0.50.2)
  • Rsubread (version >= 1.20.3)

To install missing packages, run the code from the appropriate sections ('install missing packages from ...') at example/example_run.R

Running HMZDelFinder

  • Working example that runs HMZDelFinder on 50 samples from 1000 genomes is available at example/example_run.R
  • The code was tested on Linux and may not work properly on other platforms.

Format of input files

BED file

Tab delimited file without header and four columns:

  • Chromosome
  • Start
  • Stop
  • Gene symbol

RPKM files

Tab delimited file with a header and two columns:

  • count // the number of reads overlapping with each capture target
  • RPKM // the RPKM value for each capture target

IMPORTANT: The number of rows and the order of capture targets have to correspond to the number of rows and the order defined in the BED file.

To generate RPKM files from BAM files, see comments at example/example_run.R.

VCF files

VCF files are required for AOH analysis and further filtering of identfied deletion calls. We assume that all files are single sample VCFs compressed with bz2. In general, VCF should follow the standard VCF format, however, the following columns are the most important:

  • CHROM // Chromosome
  • POS // Position
  • FILTER // Only variants with "PASS" in the FILTER column are used for AOH analysis.
  • FORMAT // 9-th VCF column containing definition of the last column
  • SAMPLE // 10-th VCF column with the genotype data and the information on the total number of reads ('DP') and the number of variant reads (e.g. 'VR')

NOTE: Please note that to calculate B-allele frequency (needed for AOH analysis) it is required that in the last column of VCF, both total number of reads and the number of variant reads are reported for every variant. Moreover, all multiallelic sites should be filtered out. Such VCFs can be generated, e.g. by Atlas2 variant caller.

Format of output files

Object returned by runHMZDelFinder(...), contains the following items:

  • filteredCalls // the list of calls after AOH and deletion size filtering
  • allCalls // the list of calls before AOH and deletion size filtering
  • bedOrdered // the data.table containing ordered coordinates of coverage targets
  • rpkmDtOrdered // the data.table containing RPKM data for all samples; rowNames corresponds to sample identifiers and columns to coverage targets.

Format of filteredCalls/allCalls

Both objects are data.frames with the following columns:

  • Chr // Chromosome
  • Start // Start position of deletion call
  • Stop // End position of deletion call
  • Genes // Comma separated list of genes encompassed by deletion
  • Start_idx // Index of first target
  • Mark_num // Number of targets that indicate deletion
  • Exon_num // Number of exons afftected (total number of targets encompassed by deletion)
  • FID // Sample identifier
  • Length // Length of deletion
  • BAB // Internal sample name (used only for BHCMG samples)
  • project // Project name (used only for BHCMG samples)
  • PoorSample // TRUE if the number of calls in the sample > 98 quantile
  • posKey // (chr+start+stop)
  • key // (sampleId+chr+start+stop)
  • inAOH_1000 // TRUE if deletion overlap with any AOH region greater than 1000bp
  • ZScore // z-score
  • OverlapCnt // number of overlapping calls in other samples
  • PerSampleNr // number of calls in this sample