waldronlab/TCGAutils

splitAssays error for LAML

lgeistlinger opened this issue · 7 comments

> x <- curatedTCGAData::curatedTCGAData(diseaseCode = "LAML", assays = "RNASeq2GeneNorm", dry.run = FALSE)
> x
A MultiAssayExperiment object of 1 listed
 experiment with a user-defined name and respective class. 
 Containing an ExperimentList class object of length 1: 
 [1] LAML_RNASeq2GeneNorm-20160128: SummarizedExperiment with 20501 rows and 173 columns 
Features: 
 experiments() - obtain the ExperimentList instance 
 colData() - the primary/phenotype DataFrame 
 sampleMap() - the sample availability DataFrame 
 `$`, `[`, `[[` - extract colData columns, subset, or experiment 
 *Format() - convert into a long or wide DataFrame 
 assays() - convert ExperimentList to a SimpleList of matrices

> splitAssays(x)
Selecting 'Primary Solid Tumor' samples
Selecting 'Solid Tissue Normal' samples
Error in .checkBarcodes(barcodes) : barcode delimiters not consistent
> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] TCGAutils_1.2.1             curatedTCGAData_1.4.0      
 [3] MultiAssayExperiment_1.8.1  SummarizedExperiment_1.12.0
 [5] DelayedArray_0.8.0          BiocParallel_1.16.2        
 [7] matrixStats_0.54.0          Biobase_2.42.0             
 [9] GenomicRanges_1.34.0        GenomeInfoDb_1.18.1        
[11] IRanges_2.16.0              S4Vectors_0.20.1           
[13] BiocGenerics_0.28.0        

loaded via a namespace (and not attached):
 [1] httr_1.4.0                    bit64_0.9-7                  
 [3] jsonlite_1.6                  AnnotationHub_2.14.2         
 [5] shiny_1.2.0                   assertthat_0.2.0             
 [7] interactiveDisplayBase_1.20.0 BiocManager_1.30.4           
 [9] blob_1.1.1                    GenomeInfoDbData_1.2.0       
[11] Rsamtools_1.34.0              yaml_2.2.0                   
[13] progress_1.2.0                pillar_1.3.1                 
[15] RSQLite_2.1.1                 lattice_0.20-38              
[17] glue_1.3.0                    digest_0.6.18                
[19] promises_1.0.1                XVector_0.22.0               
[21] rvest_0.3.2                   htmltools_0.3.6              
[23] httpuv_1.4.5                  Matrix_1.2-15                
[25] XML_3.98-1.16                 pkgconfig_2.0.2              
[27] biomaRt_2.38.0                zlibbioc_1.28.0              
[29] purrr_0.2.5                   xtable_1.8-3                 
[31] later_0.7.5                   tibble_1.4.2                 
[33] GenomicFeatures_1.34.1        lazyeval_0.2.1               
[35] magrittr_1.5                  crayon_1.3.4                 
[37] mime_0.6                      memoise_1.1.0                
[39] xml2_1.2.0                    tools_3.5.1                  
[41] prettyunits_1.0.2             hms_0.4.2                    
[43] stringr_1.3.1                 AnnotationDbi_1.44.0         
[45] bindrcpp_0.2.2                Biostrings_2.50.1            
[47] compiler_3.5.1                rlang_0.3.0.1                
[49] grid_3.5.1                    GenomicDataCommons_1.6.0     
[51] RCurl_1.95-4.11               rappdirs_0.3.1               
[53] bitops_1.0-6                  ExperimentHub_1.8.0          
[55] DBI_1.0.0                     curl_3.2                     
[57] R6_2.3.0                      GenomicAlignments_1.18.0     
[59] dplyr_0.7.8                   rtracklayer_1.42.1           
[61] bit_1.1-14                    bindr_0.1.1                  
[63] readr_1.3.0                   stringi_1.2.4                
[65] Rcpp_1.0.0                    tidyselect_0.2.5             

Hi Ludwig, @lgeistlinger
Thanks for reporting this issue. I will fix the error message here.

If you do sampleTables(x) you will see that the is only one type of sample in the data:

> sampleTables(x)
$`LAML_RNASeq2GeneNorm-20160128`

 03 
173 
> sampleTypes
   Code                                        Definition Short.Letter.Code
1    01                               Primary Solid Tumor                TP
2    02                             Recurrent Solid Tumor                TR
3    03   Primary Blood Derived Cancer - Peripheral Blood                TB

In this case, you don't need to split the assays since you only have one sample type.

Best regards,
Marcel

Perhaps a better behavior here would be to rename the ExperimentList entries consistently as happens when there are 2+ sample types, and throw just a message about any sample types that are not present. Such a contract would allow splitAssays() to be a standard part of a curatedTCGAData analysis workflow, even using a more extensive default vector of sample type codes, without having to investigate which sample types are present for a given MAE, which is not that trivial to do.

If you do sampleTables(x) you will see that the is only one type of sample in the data:

It's worthwhile to note that only one sample type also occurs for other cancer types without throwing the error, e.g.:

> x <- curatedTCGAData::curatedTCGAData(diseaseCode = "ACC", assays = "RNASeq2GeneNorm", dry.run = FALSE)
> sampleTables(x)
$`ACC_RNASeq2GeneNorm-20160128`

01 
79 

I think the main problem comes from the fact that for LAML (leukemia, blood cancer), tumor samples are exceptionally reported (due to biology) as TB/03 (Primary Blood Derived Cancer); as opposed to TP/01 (Primary Solid Tumor) for all other cancer types.

I think only few people are aware of what 01 or 11 stands for in the barcode (I wasn't until here).
And although I like the extended functionality of splitAssays to be able to use a selection of sample type codes, the typical usage will likely go towards splitting between tumor samples (cases) and adjacent normal samples (controls). As such I think defaulting to 03 (tumor) and 10 (normal) here for LAML would be appropriate - which seems to be in agreement with what Levi suggests.

Also, a short look into the composition tumor (first row) vs normal (second row) for the other cancer types as obtained via splitAssays (default sampleCodes).

> nr.samples
     ACC BLCA BRCA CESC CHOL COAD DLBC ESCA GBM HNSC KICH KIRC KIRP LGG LIHC
[1,]  79  408 1093  304   36  191   48  184 153  520   66  533  290 516  371
[2,]   0   19  112    3    9    0    0   11   0   44   25   72   32   0   50
     LUAD LUSC MESO  OV PAAD PCPG PRAD READ SARC SKCM STAD TGCT THCA THYM UCEC
[1,]  515  501   87 303  178  179  497   72  259  103  415  134  501  120  370
[2,]   59   51    0   0    4    3   52    0    2    1   35    0   59    2   10
     UCS UVM
[1,]  57  80
[2,]   0   0

@lwaldron The sampleTables function already investigates what samples are available in the data.

@lgeistlinger Yes, the function has convenient defaults but it does not mean that they will always apply for all cancer types, as you have shown. In this special case, LAML has to be set to 03 which can be looked up in the provided sampleTypes table. I can make it easier and change the default to 03 for LAML but we have to be careful of going down the path of adjusting the code to the data.

Although not present in TCGA (I don't think), there are other blood cancers such as other lukemias (CLL), lymphomas (NHL), and even multiple myeloma. If they were present we'd be changing the code to account for those too.

Yes, after inspecting TCGAutils::sampleTypes (another thing I wasn't aware), my expectation on splitAssays might have been too high-level. It seems not always that straightforward to decide which samples to assign to the tumor and the normal group. In case of the above error, a pointer to sampleTypes / sampleTables should then be helpful to clarify what's going on.

Excellent!

Just to follow up:

  • sampleCodes = NULL will split the assays for all sample types found in the data
  • exclusive = TRUE will split only assays that have all the provided sampleCodes