jianhong/ChIPpeakAnno

Questions on the how genomicElementDistribution works

Opened this issue · 1 comments

hi @jianhong ,

First of all, thank you for providing such a nice tool! I am currently using the tool to analyze ATAC-seq and it works beautifully. I have some questions on the results from genomicElementDistribution.

In "genomicEelementDistribution", there are options named as "promoterRegion" and "geneDownstream". I wonder if promoterRegion is calculated based on the transcription start site (5'end of gene) of gene. Second, I wonder if either promoterRegion and geneDownstream option consider strand (+ or - strand) of gene.

so for example, let's say there is a gene name as "Tubulin" whose coordinates of start and end are 1000 and 2000 (start and end coordinates specified in GTF file), respectively, and strand is "-" and options for promoterRegion and geneDownstream are set as below

promoterRegion=c(upstream=500, downstream=500),
geneDownstream=c(upstream=0, downstream=200))

If the strandness of gene is considered, promoter region would be 1500-2500 (5'end of Tubulin is 2000 considering that the strand is "-") and downstream region would be 800-1000 (3'end of Tubulin in 1000 considering that the strand is "-").
So it would be great if you can confirm that genomicElementDistribution consider the strandness of gene or not.

Lastly, as far as I understand, If the peaks overlap multiple features (exon, intron, etc), single feature is assigned to the peak based on the pre-determined priority. It would be great if you can tell me how the features are prioritized.

In my R code, I make txdb from gencode gene annotation using the makeTxdbFromGFF as shown below

R code

setwd(dir)
library(ChIPpeakAnno)
library(GenomicFeatures)
gtf <- "gencode.v29.annotation.gtf" ## genocde annoataion
txdb <- makeTxDbFromGFF('gencode.v29.annotation.gtf')
test_peak_file <- "ATAC-seq_peaks.bed"
test_peak <- toGRanges(test_peak_file, format="BED")
out<- genomicElementDistribution(test_peak,
TxDb = txdb,
promoterRegion=c(upstream=5000, downstream=5000),
geneDownstream=c(upstream=0, downstream=2000))

Please let me know if there is any flaw in the code.
Thank you and have a nice day!

Thank you for trying ChIPpeakAnno to annotate your data.
Yes, the promoter region and downstream region will consider the strand information. The ignore.strand=TRUE is designed for the peaks to be annotated.
About the priority, it is determined by the order of element in labels parameter.
for example:

Exons=c(utr5="5' UTR", utr3="3' UTR", CDS="CDS", otherExon="Other exon")
will behavior different from
Exons=c(CDS="CDS", utr5="5' UTR", utr3="3' UTR", otherExon="Other exon")

And your codes looks good.