/AlleleHMM

A data-driven method to identify allele-specific differences in distributed functional genomic marks

Primary LanguagePythonBSD 2-Clause "Simplified" LicenseBSD-2-Clause

AlleleHMM

The key goal of AlleleHMM is to identify allele-specific blocks of signal in distributed functional genomic data assuming that contiguous genomic regions share correlated allele-specific events. We developed a hidden Markov model (HMM) that represents allelic bias in a distributed genomic mark using three hidden states: symmetric (S) distribution of the mark from both parental alleles (which shows no allelic bias), and maternally- (M) or paternally-biased (P) regions. AlleleHMM takes read counts corresponding to each allele as input. AlleleHMM uses these allele-specific read counts to set the parameters of the HMM using Baum-Welch expectation maximization. The Viterbi algorithm is then used to identify the most likely hidden states through the data, resulting in a series of candidate blocks of signal with allelic bias. We last calculated the coverage of allele-specific reads count in each predicted AlleleHMM block and performed a binomial test to examine if the block is significantly biased toward maternal or paternal transcription. The last binomial test was performed to eliminate the false positives resulted from multiple counts of a single read that mapped to multiple nearby SNPs.

  • AlleleHMM.py - identifies candidate allele-specific AlleleHMM blocks
  • BinomialTest.bsh - calculates the coverage of allele-specific reads count in each predicted AlleleHMM block and performed a binomial test to examine if the block is significantly biased toward maternal or paternal transcription

Install

git clone https://github.com/Danko-Lab/AlleleHMM.git

AlleleHMM.py

Usage

python AlleleHMM.py [options]

options:

To get help:
-h,                    Show this brief help menu.

Required options:
For non-strand-specific data such as ChIP-seq:
-i, --input_hmm=PATH   Path to the non-strnad-specific, allele-specific read counts file (AlleleHMM_input.txt)

For strand-specific data such as PRO-seq:
-p, --input_plus_hmm=PATH    Path to the plus-strand allele-specific read counts file (AlleleHMM_input_plus.txt)
-m, --input_minus_hmm=PATH   Path to the minus-strand allele-specific read counts file (AlleleHMM_input_minus.txt)

Optional operations:
-o, --output_prefix=STR      prefix for the output file. default=AlleleHMM_output
-t, --tao=FLOAT   AlleleHMM identify allele-specific blocks using 9 values of t (1E-01, 1E-02, ...,1E-09) by default.
                  User can assign a specific tao for the calculation.
  • For strand-specific data such as PRO-seq, please prepare two files.
    • AlleleHMM_input_plus.txt: allele-specific read counts file generated from plus strand
    • AlleleHMM_input_minus.txt: allele-specific read counts file generated from minus strand
python AlleleHMM.py -p AlleleHMM_input_plus.txt -m AlleleHMM_input_minus.txt -t 1e-05
  • For non-strand-specific data such as ChIP-seq, please prepare one file AlleleHMM_input.txt.
python AlleleHMM.py -i AlleleHMM_input.txt -t 1e-05

Input files

AlleleHMM takes the allele-specific read counts file in the following formats, please note:

  • Please use tab delimited text file

  • Must have header at the first line and only the first line.

  • SNP position (snppos) must be sorted according to genomic location.

  • Please see examples in the input_file_examples folder.

    chrm    snppos  mat_allele_count        pat_allele_count 
    chr1    565006  0       17      
    chr1    565286  46      0 
    chr1    565406  37      0 
    chr1    565419  31      0
    chr1    565591  27      0
    chr1    566573  0       2
    chr1    568214  0       6 
    chr1    569094  93      0 
    chr1    569933  0       2 
    

Output files

  • Please see examples in the output_file_exmaples folder.
  • AlleleHMM_output_[STRAND]_regions_t[TAO].bed: candidate blocks of signal with allelic bias in bed file format.
    • Col1: Chromosome name, sorted by dictionary-order
    • Col4: hidden states
    • Col6: + for plus strand, - for minus strand, space for non-strand-specific
    chr1    565005  565006  P       111     +
    chr1    565285  565591  M       111     +
    chr1    566572  568214  P       111     +
    chr1    569093  569094  M       111     +
    chr1    569932  16971948        S       111     +
    
  • AlleleHMM_output_t=[TAO]_parameters.txt: Optimized transition probability and expected maternal reads fraction of three states (M,P,S) using AlleleHMM.
    • t_ij: transition probability from state i to state j
    • p_i: expected maternal reads fraction of state i.v
    T=[[t_mm t_ms t_mp]
      [t_sm t_ss t_sp]
      [t_pm t_ps t_pp]]
    P= [p_m, p_s, p_p]
    

BinomialTest.bsh

Dependencies:

  • bedtools
  • BinomialTestForBed.py (Provided)
  • FalsePosForBed.py (Provided)

Usage

Perform binomial tests in the genomic regions specified in BinomTest_BED file (-b)
using the mapping location of allele-specific reads in file MAT_READ_BED (-m) and PAT_READ_BED (-p)
Reads need to map to diploid genomes, and then liftOver to the reference genome

Requirements in current working directory:
bedtools, BinomialTestForBed.py, FalsePosForBed.py

bash BinomialTest.bsh [options] -b BinomTest_BED -m MAT_READ_BED -p PAT_READ_BED

options:
To get help:
-h, --help             Show this brief help menu.

Required options:
-b,--BinomTest_BED=PATH    Path to a bed file containing regions to perform binomial tests
-m,--MAT_READ_BED=PATH     Path to a bed file containing the mapping location of maternal specific reads (liftOver to reference genome)
-p,--PAT_READ_BED=PATH     Path to a bed file containing the mapping location of paternal specific reads (liftOver to reference genome)

Optional operations:
-i,--IDENTICAL_READ_BED=PATH   Path to a bed file containing the mapping location of reads that cannot tell which allele it mapps to
-fs,--FDR_SIMS=INT             Number of simulation for FDR test [default=20]
-fc,--FDR_CUTOFF=FLOAT         FDR cut off value [default=0.1]
-ns                            Non-strand-specific analysis. Report hits in MAT_READ_BED, PAT_READ_BED, and IDENTICAL_READ_BED that overlap BinomTest_BED regardless of the strand. Strandedness is forced by default. [default=off]

Examples:

  • For strand-specific data such as PRO-seq
    bash BinomialTest.bsh -b AlleleHMM_output_plus_regions_t1E-05.bed -m MAT_READ_BED -p PAT_READ_BED -i IDENTICAL_READ_BED
    bash BinomialTest.bsh -b AlleleHMM_output_minus_regions_t1E-05.bed -m MAT_READ_BED -p PAT_READ_BED -i IDENTICAL_READ_BED
    
  • For non-strand-specific data such as ChIP-seq
    bash BinomialTest.bsh -ns -b AlleleHMM_output_both_regions_t1E-05.bed -m MAT_READ_BED -p PAT_READ_BED -i IDENTICAL_READ_BED
    

Input files

  • BinomTest_BED: output file from AlleleHMM.py AlleleHMM_output_[STRAND]_regions_t[TAO].bed. Can also be any bed file containing regions to perform binomial tests.
  • MAT_READ_BED: a bed file containing the mapping location of maternal specific reads. Reads need to map to diploid genomes, and then liftOver to the reference genome. If map to the reference genome directly, there will be mapping bias. Please see input_file_exmaples/MAT_READ_BED for an example.
  • PAT_READ_BED: a bed file containing the mapping location of paternal specific reads. Reads need to map to diploid genomes, and then liftOver to the reference genome. If map to the reference genome directly, there will be mapping bias. Please see input_file_exmaples/PAT_READ_BED for an example.

Output files

  • AlleleHMM_output_[STRAND]_regions_t[TAO]_binomtest.bed : contains the binomial test p-value of all candidate allele-specific AlleleHMM blocks.
  • AlleleHMM_output_[STRAND]_regions_t[TAO]_binomtest_FDR.txt : contain the FDR cut off value of p-value.
  • AlleleHMM_output_[STRAND]_regions_t[TAO]_binomtest_SigBlocks.bed : contains the allele-specific AlleleHMM blocks that are significantly biased. (default FDR <= 10%)