waldronlab/curatedTCGAData

status of repeated samples from donors in BRCA

Closed this issue · 2 comments

14/45 packages newly attached/loaded, see sessionInfo() for details.

> c1 = curatedTCGAData("BRCA", "RNASeq2GeneNorm", dry.run=FALSE)
snapshotDate(): 2020-02-12
see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation
loading from cache
see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation
loading from cache
see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation
loading from cache
see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation
loading from cache
harmonizing input:
  removing 13161 sampleMap rows not in names(experiments)
  removing 5 colData rownames not in sampleMap 'primary'

> se1 = experiments(c1)[[1]]

> cdc1 = colData(c1)

> se1f = se1[, -which(duplicated(substr(colnames(se1),1,12)))]

> colnames(se1f)  =substr(colnames(se1f),1,12)

> colData(se1f) = colData(c1)[colnames(se1f),]

> all.equal(se1f$patientID, colnames(se1f))
[1] TRUE
> dim(c1)
NULL
> dim(se1)
[1] 20501  1212
> dim(colData(c1))
[1] 1093 2684

So there are 1212 RNASeq contributions on 1093 individuals. I thought this was explained somewhere but I can't put my finger on it.

Hi Vince, @vjcitn
That is correct there are replicates in the data likely due to the diverse
set of samples in the data which includes normals as well:

> library(TCGAutils)
> sampleTables(c1)
$`BRCA_RNASeq2GeneNorm-20160128`

  01   06   11 
1093    7  112 

The best way to check this is to use replicated function which will give you a logical vector for each entry in the colData that corresponds to the assay columns in question.

> replicated(c1)
$`BRCA_RNASeq2GeneNorm-20160128`
LogicalList of length 1093
[["TCGA-3C-AAAU"]] FALSE FALSE FALSE FALSE FALSE ... FALSE FALSE FALSE FALSE
[["TCGA-3C-AALI"]] FALSE FALSE FALSE FALSE FALSE ... FALSE FALSE FALSE FALSE
[["TCGA-3C-AALJ"]] FALSE FALSE FALSE FALSE FALSE ... FALSE FALSE FALSE FALSE
[["TCGA-3C-AALK"]] FALSE FALSE FALSE FALSE FALSE ... FALSE FALSE FALSE FALSE
[["TCGA-4H-AAAK"]] FALSE FALSE FALSE FALSE FALSE ... FALSE FALSE FALSE FALSE
[["TCGA-5L-AAT0"]] FALSE FALSE FALSE FALSE FALSE ... FALSE FALSE FALSE FALSE
[["TCGA-5L-AAT1"]] FALSE FALSE FALSE FALSE FALSE ... FALSE FALSE FALSE FALSE
[["TCGA-5T-A9QA"]] FALSE FALSE FALSE FALSE FALSE ... FALSE FALSE FALSE FALSE
[["TCGA-A1-A0SB"]] FALSE FALSE FALSE FALSE FALSE ... FALSE FALSE FALSE FALSE
[["TCGA-A1-A0SD"]] FALSE FALSE FALSE FALSE FALSE ... FALSE FALSE FALSE FALSE

You can then use this information to pin point the columns that come from the same participant:

> Filter(length, which(replicated(c1)[[1]]))
IntegerList of length 116
[["TCGA-A7-A0CE"]] 126 127
[["TCGA-A7-A0CH"]] 129 130
[["TCGA-A7-A0D9"]] 132 133
[["TCGA-A7-A0DB"]] 135 136
[["TCGA-A7-A13E"]] 138 139
[["TCGA-A7-A13F"]] 140 141
[["TCGA-A7-A13G"]] 142 143
[["TCGA-AC-A23H"]] 260 261
[["TCGA-AC-A2FB"]] 265 266
[["TCGA-AC-A2FF"]] 268 269

If you're only interested in primary tumors, you can use TCGAutils:

> TCGAprimaryTumors(c1)
harmonizing input:
  removing 119 sampleMap rows with 'colname' not in colnames of experiments
A MultiAssayExperiment object of 1 listed
 experiment with a user-defined name and respective class.
 Containing an ExperimentList class object of length 1:
 [1] BRCA_RNASeq2GeneNorm-20160128: SummarizedExperiment with 20501 rows and 1093 columns
Features:
 experiments() - obtain the ExperimentList instance
 colData() - the primary/phenotype DataFrame
 sampleMap() - the sample availability DFrame
 `$`, `[`, `[[` - extract colData columns, subset, or experiment
 *Format() - convert into a long or wide DataFrame
 assays() - convert ExperimentList to a SimpleList of matrices

Otherwise, you can use TCGAsampleSelect or splitAssays based on sample codes.
I hope that helps. Thanks.

Many many thanks. How could I forget about paired normal samples ... but I did.