leekgroup/recount

POU5F1 and genes on more than one chromosome are currently not in recount

lcolladotor opened this issue · 1 comments

I'm migrating the discussion on this topic to the issues page.


Original issue with POU5F1

About a week ago https://gist.github.com/ronstewart reported that POU5F1 is missing from recount in an email to me. That lead to my reply involving the following gist https://gist.github.com/lcolladotor/374a0de6be5c202bbf216295989e534a:

I looked into the issue you reported and indeed POU5F1 is not part of
recount. It's because it's not part of
TxDb.Hsapiens.UCSC.hg38.knownGene which is the annotation we used.
However, you can still use the recount package to compute the coverage
matrix at the exon level (the gene level is just the sum of the exon
level) by modifying the code at
http://bioconductor.org/packages/release/bioc/vignettes/recount/inst/doc/recount-quickstart.html#using-anothernewer-annotation.

Here are the actual details of what I did
https://gist.github.com/lcolladotor/374a0de6be5c202bbf216295989e534a.

Actual issue

Ron replied with a comment on the gist where he is concerned that other genes might be missing in recount.

In recount::reproduce_ranges() we use GenomicFeatures::genes() which by default has the argument has the argument single.strand.genes.only set to FALSE. As seen in the documentation, this argument also drops genes that are on two different chromosomes.

single.strand.genes.only	
TRUE or FALSE. If TRUE (the default), then genes that have exons located on both strands of the same chromosome or on two different chromosomes are dropped. In that case, the genes are returned in a GRanges object. Otherwise, all genes are returned in a GRangesList object with the columns specified thru the columns argument set as top level metadata columns. (Please keep in mind that the top level metadata columns of a GRangesList object are not displayed by the show method.)

This explains why POU5F1 is missing as shown below because it is present in chr6 (a canonical chr) and a few alternative chromosomes.

> library('GenomicFeatures')
> library('TxDb.Hsapiens.UCSC.hg38.knownGene')
> g <- genes(TxDb.Hsapiens.UCSC.hg38.knownGene, single.strand.genes.only = FALSE)
> g['5460']
GRangesList object of length 1:
$5460 
GRanges object with 7 ranges and 0 metadata columns:
                 seqnames               ranges strand
                    <Rle>            <IRanges>  <Rle>
  [1]                chr6 [31164337, 31180731]      -
  [2] chr6_GL000251v2_alt [ 2646761,  2653135]      -
  [3] chr6_GL000252v2_alt [ 2423653,  2430027]      -
  [4] chr6_GL000253v2_alt [ 2474838,  2481213]      -
  [5] chr6_GL000254v2_alt [ 2508452,  2514827]      -
  [6] chr6_GL000255v2_alt [ 2422370,  2428743]      -
  [7] chr6_GL000256v2_alt [ 2467753,  2474125]      -

-------
seqinfo: 455 sequences (1 circular) from hg38 genome
> 
> 
> 
> g2 <- genes(TxDb.Hsapiens.UCSC.hg38.knownGene, single.strand.genes.only = TRUE)
> g2['5460']
Error: subscript contains invalid names
> 

Status

For now, we are discussing internally what we'll do. There are several options and in any case it's possible to compute the coverage matrix for genes like POU5F1 that are present in a canonical chromosome and several non-canonical ones; these are genes researches like Ron Stewart might be interested in.

Best,
Leo

This issue is now solved. We changed the annotation for the exon and gene counts to Gencode v25 (CHR regions). The changes are live now via https://jhubiostatistics.shinyapps.io/recount/ and can be accessed with recount version 1.0.8 (Bioc-release) and 1.1.14 (Bioc-devel). It'll take about 36-48 hrs for the latest binaries of recount to be available via biocLite().