rse_gene/rse_exon have too many counts by several orders of magnitude?

Question

rse_gene/rse_exon have too many counts by several orders of magnitude?

ceiabreu opened this issue 5 years ago · 1 comments

Hi all,

Not sure if this is a problem of the project I decided to work with (SRP058740), or could be more general. First, let me state that according to the SRA website, I would expect around 30-60 million counts per sample. For instance, https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR2040575 has 73M spots). For this project, I downloaded (both directly from website, and using the recount bioC package, same result) the rse_tx, rse_gene and rse_exon Rdata files, and loaded them in R.

colSums(assays(rse_gene)$counts)["SRR2040575"]/1e6
SRR2040575
13788.46
colSums(assays(rse_exon)$counts)["SRR2040575"]/1e6
SRR2040575
13788.46
colSums(assays(rse_tx)$fragments, na.rm=TRUE)["SRR2040575"]/1e6
SRR2040575
63.74166

Only the last one makes any sense. Am I missing something obvious?

Output of sessionInfo() below.

Thanks,

Cei

R version 3.6.2 (2019-12-12)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Catalina 10.15.4

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages:
[1] recount_1.12.1 SummarizedExperiment_1.16.1 DelayedArray_0.12.2 BiocParallel_1.20.1 matrixStats_0.56.0
[6] Biobase_2.46.0 GenomicRanges_1.38.0 GenomeInfoDb_1.22.1 IRanges_2.20.2 S4Vectors_0.24.3
[11] BiocGenerics_0.32.0

loaded via [1] colorspace_1.4-1 [6] rstudioapi_0.11 [11] splines_3.6.2 [16] cluster_2.1.0 [21] compiler_3.6.2 [26] limma_3.42.2 [31] gtable_0.3.0 [36] rappdirs_0.3.1 [41] Biostrings_2.54.0 [46] lifecycle_0.2.0 [51] BSgenome_1.54.0 [56] RColorBrewer_1.1-2 [61] downloader_0.4 [66] RSQLite_2.2.0 [71] pkgconfig_2.0.3 [76] GenomicAlignments_1.22.1 [81] magrittr_1.5 [86] foreign_0.8-76 [91] crayon_1.3.4 [96] locfit_1.5-9.4 [101] tidyr_1.0.2 a namespace (and not attached):
qvalue_2.18.0 htmlTable_1.13.3 XVector_0.26.0 base64enc_0.1-3
bit64_0.9-7 AnnotationDbi_1.48.0 xml2_1.2.5 codetools_0.2-16
knitr_1.28 Formula_1.2-3 jsonlite_1.6.1 Rsamtools_2.2.3
dbplyr_1.4.2 png_0.1-7 rentrez_1.2.2 readr_1.3.1
httr_1.4.1 backports_1.1.5 assertthat_0.2.1 Matrix_1.2-18
acepack_1.4.1 htmltools_0.4.0 prettyunits_1.1.1 tools_3.6.2
glue_1.3.2 GenomeInfoDbData_1.2.2 reshape2_1.4.3 dplyr_0.8.5
doRNG_1.8.2 Rcpp_1.0.4 bumphunter_1.28.0 vctrs_0.2.4
rtracklayer_1.46.0 iterators_1.0.12 xfun_0.12 stringr_1.4.0
rngtools_1.5 XML_3.99-0.3 zlibbioc_1.32.0 scales_1.1.0
VariantAnnotation_1.32.0 hms_0.5.3 GEOquery_2.54.1 derfinderHelper_1.20.0
curl_4.3 memoise_1.1.0 gridExtra_2.3 ggplot2_3.3.0
biomaRt_2.42.1 rpart_4.1-15 latticeExtra_0.6-29 stringi_1.4.6
foreach_1.5.0 checkmate_2.0.0 GenomicFeatures_1.38.2 rlang_0.4.5
GenomicFiles_1.22.0 bitops_1.0-6 lattice_0.20-40 purrr_0.3.3
htmlwidgets_1.5.1 bit_1.1-15.2 tidyselect_1.0.0 plyr_1.8.6
R6_2.4.1 Hmisc_4.4-0 DBI_1.1.0 pillar_1.4.3
survival_3.1-11 RCurl_1.98-1.1 nnet_7.3-13 tibble_2.1.3
derfinder_1.20.0 BiocFileCache_1.10.2 jpeg_0.1-8.1 progress_1.2.2
grid_3.6.2 data.table_1.12.8 blob_1.2.1 digest_0.6.25
openssl_1.4.1 munsell_0.5.0 askpass_1.1

Answer 1 · 2020-04-01T01:55:44.000Z

Hi Cei,

Your question is related to https://support.bioconductor.org/p/95877/ and https://support.bioconductor.org/p/121767/. The long answer is also explained at https://f1000research.com/articles/6-1558/v1.

Basically, you are looking at sum of coverage across each exonic base, which is why it's order of magnitudes different from your typical read counts. This is due to how we actually computed these coverage counts from the coverage bigwig files instead of read alignment files that typical pipelines use. We only ever produced bigwig files since they are smaller than BAM files and made the whole project feasible (given our budget and the computing costs associated with recount2).

Best,
Leo