leekgroup/recount

NA's in rse_tx_brain

Closed this issue · 3 comments

Hi,
I downloaded the rse_tx_brain from https://jhubiostatistics.shinyapps.io/recount/ for both TCGA and GTEX. I only renamed them as "rse_tx_brain_TCGA.RData" and "rse_tx_brain_GTEX.RData" ( otherwise both objects would be called "rse_tx_brain.RData"

When I load both the objects, I notice a lot of NA's for a lot of transcripts - I was wondering why that is ? The NA's most likely make getRPKM() fail too.

> rm(list=ls())
> library(SummarizedExperiment)
> setwd("~/HollandLabShared/Sonali/recount_brain")
> gtex = get(load("rse_tx_brain_GTEX.Rdata"))
> tcga = get(load("rse_tx_brain_TCGA.RData"))
> 
> tcga_rpkm = getRPKM(tcga)
Error in base::colSums(x, na.rm = na.rm, dims = dims, ...) : 
  'x' must be an array of at least two dimensions
> gtex_rpkm = getRPKM(gtex)
Error in base::colSums(x, na.rm = na.rm, dims = dims, ...) : 
  'x' must be an array of at least two dimensions

I was under the impression that maybe the same transcript is NA for all samples - but that is not the case.

> length(which(is.na(assay(gtex[,1]))))
[1] 522
> length(which(is.na(assay(gtex[,151]))))
[1] 436
> 
> length(which(is.na(assay(gtex[,20]))))
[1] 892
> length(which(is.na(assay(gtex[,251]))))
[1] 436

> summary(as.numeric(assay(tcga[grep("ENST00000304053.10", rownames(tcga)), ])))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
      0       0       0       0       0       0     543 

Kindly advise.
Thanks,
Sonali.

Hi Sonali!

Thanks for the question!

Ahh, yeah, I see how the names can clash. We recommend saving each object inside a directory for the given project, so SRP012682/rse_tx_brain.RData for GTEx and TCGA/rse_tx_brain.RData for TCGA. I should try to resolve this along with #15 and #16! Sorry for the delay on those!

As for the NAs on the rse_tx files, Jack Fu @JMF47, the lead author behind the transcript files should hopefully be able to respond after the long weekend break.

Best,
Leo

JMF47 commented

hi @sonali-bioc, one of the factors that determines the number of NA transcripts reported is the read-length of sequencing. The read-length affects how well we can deconvolve exon+junction counts into transcript-level counts. Could you check if for instance samples gtex[,151] and gtex[,251] have the same read lengths?

I'm closing this issue since it seems inactive.