/BETA

BETA

Primary LanguagePython

BETA: Binding and Expression Target Analysis

Introduction

Binding and Expression Target Analysis (BETA) is a software package that integrates ChIP-seq of transcription factors or chromatin regulators with differential gene expression data to infer direct target genes.

Note

This is a just snapshot of the BETA repository! To download the latest version of BETA, please go to http://cistrome.org/BETA/.

Citation

Wang, S., Sun, H., Ma, J., Zang, C., Wang, C., Wang, J., ... & Liu, X. S. (2013). Target analysis by integration of transcriptome and ChIP-seq data with BETA. Nature protocols, 8(12), 2502-2515.

Run BETA on Web Server

Go to http://cistrome.org/ap to run on our web server at Cistrome

Python Version

Python 2.6 or above is recommended.

Installation

$ python setup.py install

Command Line

Help

$ BETA --help

BETA --- Binding Expression Target Analysis
BETA [options]* -p <peak> -e <expression> -k <type> -b <boundary> -g <genome>

Main Arguments

-p PEAKFILE, --peakfile=PEAKFILE
 Input the bed format peak file of the factor
-e EXPREFILE, --diff_expr=EXPREFILE
 Input the differential expression file get from limma for MicroArray data and cuffdiff for RNAseq data
-k KIND, --kind=KIND
 The kind of your expression file,this is required, it can be M or R. M for Microarray. R for RNAseq
-b BOUNDARYFILE, --bound=BOUNDARYFILE
 Input the conserved CTCF binding sites boundary bed format file
-g GENOME, --genome=GENOME
 Select a genome file (sqlite3 file) to search refGenes.

Options

--version Show program's version number and exit
-h, --help Show this help message and exit.
--pn=PEAKNUMBER
 The number of peaks you want to consider, DEFAULT=10000
-n NAME, --name=NAME
 This argument is used to name the result file.If not set, the peakfile name will be used instead
-d DISTANCE, --distance=DISTANCE
 Set a number which unit is 'base'. It will get peaks within this distance from gene TSS. default:100000(100kb)
--df=DIFF_FDR Input a number 0~1 as a threshold to pick out the most significant differential expressed genes by FDR, DEFAULT = 1, that is select all the genes
--da=DIFF_AMOUNT
 Input a number between 0-1, so that the script will only output a percentage of most significant differential expressed genes,input a number bigger than 1, for example, 2000. so that the script will only output top 2000 genes DEFAULT = 0.5, that is select top 25 percentage,NOTE:If you want to use diff_fdr, please set this parameter to 1, otherwose it will get the intersection of these two parameters
-c CUTOFF, --cutoff=CUTOFF
 Input a number between 0~1 as a threshold to select the closer target gene list(up regulate or down regulate or both) with the p value was called by one side ks-test, DEFAULT = 0.001
--pt=PERMUTETIMES
 Permutaton times,give a resonable value to get an exact FDR.Gene number and permute times decide the time it will take. DEFAULT=500

Example

BETA -p 2723_peaks.bed -e gene_exp.diff -b hg19_CTCF_bound.bed -k R -g hg19.refseq

Input Files Format

  • Peak : BED format

    chroms start end name score [strand]

    If your bed don't have the name and score column, please fake one.

  • Expression by Microarray : Result of Limma

    ID Refseq logFC AveExpre Tscore Pvalue adj.P.Value B

  • Expression by RNAseq : Result of Cufflinks

    Test_id gene_id gene locus sample_1 sample_2 status value_1 value_2 Log2(foldchange) test_stat p_value q_value significant

  • CTCF conserved boundary : BED format

    chroms start end name score [strand]

    The conserve CTCF binding sites of all the cell lines.

  • Genome reference ; Downloaded from UCSC

    refseqID chroms strand txstart txend genesymbol.

    We use that as a reference to get the gene information.

Output Files

  • score.pdf : A CDF figure to test the TF's funtion, Up pr Down regulation.
  • score.r : The R script to draw the score.pdf figure
  • uptarget.txt : The uptarget genes, 4 column, Refseq, Gene Symbol, Rank Product, FDR
  • downtarget.txt : The downregulate genes, the same format to uptarget.

NOTE: Up or Down target file depends on the test result in the PDF file, it will be not produced enless it passed the threshold you seted via -c --cutoff