splitAssays error for LAML
lgeistlinger opened this issue · 7 comments
> x <- curatedTCGAData::curatedTCGAData(diseaseCode = "LAML", assays = "RNASeq2GeneNorm", dry.run = FALSE)
> x
A MultiAssayExperiment object of 1 listed
experiment with a user-defined name and respective class.
Containing an ExperimentList class object of length 1:
[1] LAML_RNASeq2GeneNorm-20160128: SummarizedExperiment with 20501 rows and 173 columns
Features:
experiments() - obtain the ExperimentList instance
colData() - the primary/phenotype DataFrame
sampleMap() - the sample availability DataFrame
`$`, `[`, `[[` - extract colData columns, subset, or experiment
*Format() - convert into a long or wide DataFrame
assays() - convert ExperimentList to a SimpleList of matrices
> splitAssays(x)
Selecting 'Primary Solid Tumor' samples
Selecting 'Solid Tissue Normal' samples
Error in .checkBarcodes(barcodes) : barcode delimiters not consistent
> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS
Matrix products: default
BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] TCGAutils_1.2.1 curatedTCGAData_1.4.0
[3] MultiAssayExperiment_1.8.1 SummarizedExperiment_1.12.0
[5] DelayedArray_0.8.0 BiocParallel_1.16.2
[7] matrixStats_0.54.0 Biobase_2.42.0
[9] GenomicRanges_1.34.0 GenomeInfoDb_1.18.1
[11] IRanges_2.16.0 S4Vectors_0.20.1
[13] BiocGenerics_0.28.0
loaded via a namespace (and not attached):
[1] httr_1.4.0 bit64_0.9-7
[3] jsonlite_1.6 AnnotationHub_2.14.2
[5] shiny_1.2.0 assertthat_0.2.0
[7] interactiveDisplayBase_1.20.0 BiocManager_1.30.4
[9] blob_1.1.1 GenomeInfoDbData_1.2.0
[11] Rsamtools_1.34.0 yaml_2.2.0
[13] progress_1.2.0 pillar_1.3.1
[15] RSQLite_2.1.1 lattice_0.20-38
[17] glue_1.3.0 digest_0.6.18
[19] promises_1.0.1 XVector_0.22.0
[21] rvest_0.3.2 htmltools_0.3.6
[23] httpuv_1.4.5 Matrix_1.2-15
[25] XML_3.98-1.16 pkgconfig_2.0.2
[27] biomaRt_2.38.0 zlibbioc_1.28.0
[29] purrr_0.2.5 xtable_1.8-3
[31] later_0.7.5 tibble_1.4.2
[33] GenomicFeatures_1.34.1 lazyeval_0.2.1
[35] magrittr_1.5 crayon_1.3.4
[37] mime_0.6 memoise_1.1.0
[39] xml2_1.2.0 tools_3.5.1
[41] prettyunits_1.0.2 hms_0.4.2
[43] stringr_1.3.1 AnnotationDbi_1.44.0
[45] bindrcpp_0.2.2 Biostrings_2.50.1
[47] compiler_3.5.1 rlang_0.3.0.1
[49] grid_3.5.1 GenomicDataCommons_1.6.0
[51] RCurl_1.95-4.11 rappdirs_0.3.1
[53] bitops_1.0-6 ExperimentHub_1.8.0
[55] DBI_1.0.0 curl_3.2
[57] R6_2.3.0 GenomicAlignments_1.18.0
[59] dplyr_0.7.8 rtracklayer_1.42.1
[61] bit_1.1-14 bindr_0.1.1
[63] readr_1.3.0 stringi_1.2.4
[65] Rcpp_1.0.0 tidyselect_0.2.5
Hi Ludwig, @lgeistlinger
Thanks for reporting this issue. I will fix the error message here.
If you do sampleTables(x)
you will see that the is only one type of sample in the data:
> sampleTables(x)
$`LAML_RNASeq2GeneNorm-20160128`
03
173
> sampleTypes
Code Definition Short.Letter.Code
1 01 Primary Solid Tumor TP
2 02 Recurrent Solid Tumor TR
3 03 Primary Blood Derived Cancer - Peripheral Blood TB
In this case, you don't need to split the assays since you only have one sample type.
Best regards,
Marcel
Perhaps a better behavior here would be to rename the ExperimentList
entries consistently as happens when there are 2+ sample types, and throw just a message about any sample types that are not present. Such a contract would allow splitAssays()
to be a standard part of a curatedTCGAData
analysis workflow, even using a more extensive default vector of sample type codes, without having to investigate which sample types are present for a given MAE, which is not that trivial to do.
If you do sampleTables(x) you will see that the is only one type of sample in the data:
It's worthwhile to note that only one sample type also occurs for other cancer types without throwing the error, e.g.:
> x <- curatedTCGAData::curatedTCGAData(diseaseCode = "ACC", assays = "RNASeq2GeneNorm", dry.run = FALSE)
> sampleTables(x)
$`ACC_RNASeq2GeneNorm-20160128`
01
79
I think the main problem comes from the fact that for LAML
(leukemia, blood cancer), tumor samples are exceptionally reported (due to biology) as TB/03 (Primary Blood Derived Cancer); as opposed to TP/01 (Primary Solid Tumor) for all other cancer types.
I think only few people are aware of what 01
or 11
stands for in the barcode (I wasn't until here).
And although I like the extended functionality of splitAssays
to be able to use a selection of sample type codes, the typical usage will likely go towards splitting between tumor samples (cases) and adjacent normal samples (controls). As such I think defaulting to 03
(tumor) and 10
(normal) here for LAML would be appropriate - which seems to be in agreement with what Levi suggests.
Also, a short look into the composition tumor (first row) vs normal (second row) for the other cancer types as obtained via splitAssays
(default sampleCodes
).
> nr.samples
ACC BLCA BRCA CESC CHOL COAD DLBC ESCA GBM HNSC KICH KIRC KIRP LGG LIHC
[1,] 79 408 1093 304 36 191 48 184 153 520 66 533 290 516 371
[2,] 0 19 112 3 9 0 0 11 0 44 25 72 32 0 50
LUAD LUSC MESO OV PAAD PCPG PRAD READ SARC SKCM STAD TGCT THCA THYM UCEC
[1,] 515 501 87 303 178 179 497 72 259 103 415 134 501 120 370
[2,] 59 51 0 0 4 3 52 0 2 1 35 0 59 2 10
UCS UVM
[1,] 57 80
[2,] 0 0
@lwaldron The sampleTables
function already investigates what samples are available in the data.
@lgeistlinger Yes, the function has convenient defaults but it does not mean that they will always apply for all cancer types, as you have shown. In this special case, LAML
has to be set to 03
which can be looked up in the provided sampleTypes
table. I can make it easier and change the default to 03
for LAML
but we have to be careful of going down the path of adjusting the code to the data.
Although not present in TCGA (I don't think), there are other blood cancers such as other lukemias (CLL), lymphomas (NHL), and even multiple myeloma. If they were present we'd be changing the code to account for those too.
Yes, after inspecting TCGAutils::sampleTypes
(another thing I wasn't aware), my expectation on splitAssays
might have been too high-level. It seems not always that straightforward to decide which samples to assign to the tumor and the normal group. In case of the above error, a pointer to sampleTypes
/ sampleTables
should then be helpful to clarify what's going on.
Excellent!
Just to follow up:
sampleCodes = NULL
will split the assays for all sample types found in the dataexclusive = TRUE
will split only assays that have all the providedsampleCodes