seqinfo for GRanges elements?
Closed this issue · 7 comments
note the unspecified genome/no seqlengths at the end
> gbmMAE
A MultiAssayExperiment object of 4 listed
experiments with user-defined names and respective classes.
Containing an ExperimentList class object of length 4:
[1] GBM_CNASNP-20160128: RaggedExperiment with 602338 rows and 1104 columns
[2] GBM_mRNAArray_huex-20160128: SummarizedExperiment with 18632 rows and 431 columns
[3] GBM_mRNAArray_TX_g4502a-20160128: SummarizedExperiment with 17814 rows and 502 columns
[4] GBM_mRNAArray_TX_ht_hg_u133a-20160128: SummarizedExperiment with 12042 rows and 528 columns
Features:
experiments() - obtain the ExperimentList instance
colData() - the primary/phenotype DataFrame
sampleMap() - the sample availability DataFrame
`$`, `[`, `[[` - extract colData columns, subset, or experiment
*Format() - convert into a long or wide DataFrame
assays() - convert ExperimentList to a SimpleList of matrices
> rowRanges(experiments(gbmMAE)[[1]])
GRanges object with 602338 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 1 61735-25418699 *
[2] 1 25423401-25424322 *
[3] 1 25424889-25583341 *
[4] 1 25593128-25662212 *
[5] 1 25663310-72750353 *
... ... ... ...
[602334] 23 148888090-148888542 *
[602335] 23 148888898-152528086 *
[602336] 23 152528150-152531276 *
[602337] 23 152532889-155182354 *
[602338] 24 2650438-59018259 *
-------
seqinfo: 24 sequences from an unspecified genome; no seqlengths
Hi Vince, @vjcitn
Thanks for pointing this out. I will see if I can modify the code in
WaldronLab/MultiAssayExperiment-TCGA to update datasets with genome info.
I remember doing this in the past but I'm not sure if it worked for all datasets.
Essentially, it tries to provide build information from either the file names or a column in the data.
Regards,
Marcel
It is increasingly difficult to find documentation on the Broad Firehose Pipeline, and I find contradictory information. Even this FAQ seems to indicate uncertainty:
Q: What reference genome build are you using?
A: We match the reference genome used in our analyses to the reference used to generate the data as appropriate. Our understanding is that TCGA standards stipulate that OV, COAD/READ, and LAML data are hg18, and all else is hg19. caveat: SNP6 copy number data is available in both hg18 and hg19 for all tumor cohorts, so we use hg19 for copy number analyses in all cases.
From https://confluence.broadinstitute.org/display/GDAC/FAQ#FAQ-EndOfTCGAQIunderstandthatTCGAdatahasmigratedtotheGDCbutwhydoIseediscrepanciesbetweenGDCandFireBrowse this states (although I believe incorrectly) that GDAC Firehose & FireBrowse portals ONLY serve HG19 data. Note we are using Firehose legacy data, and not through GDC.
Q: I understand that TCGA data has migrated to the GDC, but why do I see discrepancies between GDC and FireBrowse?
A: Note that the GDC serves both HG38 and HG19 data. The HG19 data are considered “legacy” and represent the original calls as made by each of the sequencing centers in TCGA; they ARE NOT the default data served by the GDC, and instead are served from the (slightly hidden) legacy archive section of the GDC portal. By default the public GDC interface serves HG38 data; these are newly generated at the GDC itself, with the intent to smooth over differences across the entire set of TCGA samples by “harmonizing” them with common variant callers and reference data. It is important to understand that these HG38 data are not the original HG19 legacy data that is discussed in most of the current TCGA publications. Lastly, note that the public GDAC Firehose & FireBrowse portals ONLY serve HG19 data; we’ve been reluctant to release HG38 data (and analyses of them) to the general public until they have gone through more in-depth QC/vetting. This QC has not been fully completed yet, but is an active area of investigation (with an analysis working group, or AWG) within the nascent GDAN. We are aiming to have a first release of HG38 GDAC pipelines in FireBrowse by Q1 of 2018, after the QC group completes its assesment to the satisfaction of the NCI.
I also drafted a function for adding ranges to those SummarizedExperiments with rownames as genes in curatedTCGAData, using hg19. It's pretty specific to curatedTCGAData and has a hack (as with the other gist I recently posted) to get around being able to concatenate to a MultiAssayExperiment with the desired name. Would require some testing and cleaning to put in the package, but let me know if it seems useful:
https://gist.github.com/lwaldron/63b403803e91b3a3ce72592fa6e85f79
> symbolsToRanges(miniACC)
'select()' returned 1:1 mapping between keys and columns
'select()' returned 1:1 mapping between keys and columns
'select()' returned 1:1 mapping between keys and columns
'select()' returned 1:1 mapping between keys and columns
'select()' returned 1:1 mapping between keys and columns
'select()' returned 1:1 mapping between keys and columns
A MultiAssayExperiment object of 7 listed
experiments with user-defined names and respective classes.
Containing an ExperimentList class object of length 7:
[1] Mutations: matrix with 97 rows and 90 columns
[2] miRNASeqGene: SummarizedExperiment with 471 rows and 80 columns
[3] RNASeq2GeneNorm_ranged: RangedSummarizedExperiment with 195 rows and 79 columns
[4] RNASeq2GeneNorm_unranged: SummarizedExperiment with 3 rows and 79 columns
[5] gistict_ranged: RangedSummarizedExperiment with 195 rows and 90 columns
[6] gistict_unranged: SummarizedExperiment with 3 rows and 90 columns
[7] RPPAArray_ranged: RangedSummarizedExperiment with 33 rows and 46 columns
Features:
experiments() - obtain the ExperimentList instance
colData() - the primary/phenotype DataFrame
sampleMap() - the sample availability DataFrame
`$`, `[`, `[[` - extract colData columns, subset, or experiment
*Format() - convert into a long or wide DataFrame
assays() - convert ExperimentList to a SimpleList of matrices
>
Skip the gists now, and just try the conveniencefuns branch. They're documented there and have an additional "all-in-one" simplifyTCGA function (demo at on issue #18 ).