gagneurlab/OUTRIDER

Missing possible outliers due to high pvalues

Closed this issue · 11 comments

Hi,

we have already successfully used OUTRIDER to identify helpful insights into one of our RNA-Seq datasets, but some observation made us wondering if we are missing possible outlier candidates:

We have a base data set of 42 samples and a second data set from similar tissue but different sequencing batch consisting of 6 samples. We tried analysing them on their own and in combination.
We observed at least one gene that had a significant adjusted pvalue (<0.1) only after adding the additional samples:

non_sig_gene

sig_gene

Judging from the plots the expression pattern looks pretty similar (pvalues shown in the popup seem to be incorrect, i.e. different than the ones reported in the actual data), but there was an overall negative shift in the median expression.

Both times we used findEncodingDim with default parameters and in both cases q = 4 was recommended.

We've also seen other cases were from a visual inspection and previous in vitro experiments genes seem aberrant, but OUTRIDER's adjust pvalues are > 0.5, e.g.:

non_aberrant

Since you recommend not to combine different datasets due to OUTRIDER not being able to handle batch effects, can you think of a solution to better handle these possible outliers?

Side note: I had to slightly adapt your code, since for our dataset findEncodingDim always ran into errors when iterating through different q values. I've added some error catching to R/method-gridSearch.R:

-    ods <- OUTRIDER(ods, controlData=TRUE,q=encoding_dim, BPPARAM=BPPARAM, ...)
-    eloss <- evalAucPRLoss(ods)
+   
+    eloss <- tryCatch({
+      ods <- OUTRIDER(ods, controlData=TRUE,q=encoding_dim, BPPARAM=BPPARAM, ...)
+      evalAucPRLoss(ods)
+    }, error = function(e) {
+      warning('Catching error during encoding search')
+      NaN
+    })

Thanks for raising your concerns.
Regarding the bug in the filterExpression plot, we fixed it in the latest version on github.

Regarding merging data:
The more samples, the better the model and the outlier calls
The autoencoder should be able to correct for the different batches
Merging data is only problematic when different sequencing protocols or tissues where used

When merging data, we recommend to:
Check that the size factors are similar
Check that a similar set of genes are expressed using the plotExpressedGenes function.
Check the plotCountCorHeatmap before and after calling OUTRIDER (normalized = FALSE, normalized = TRUE). The batches should be gone after calling OUTRIDER.

Both outliers that were missed could be due to a very small sample size. In order to better call outliers, you can try to increase the sample size, for example by merging with the same tissue from GTEx.

Thanks for your suggestion on the findEncodingDim function. Can you please provide us with a minimal dataset or example that reproduces your issue. We are happy to incorporate your suggestion once we understand where and how the problem originates.

Best,
Vicente

Hi Vicente,

thanks for the tips, we'll look into integrating matching GTEx datasets.

As for the reproducible example, I can't reproduce the error with the example data in your package, so I'll first need the permission to share our data with you.

Best,
Christian

Hi Vicente,

can you give me an email address to where I can send our data?

The following code should be able to reproduce the errors we were facing:

library(OUTRIDER)

count_data <- read.table('~/countmatrix.tsv')
# Taken from: ftp://ftp.ensembl.org/pub/release-98/gtf/homo_sapiens/Homo_sapiens.GRCh38.98.gtf.gz
gtf_txdb <- makeTxDbFromGFF('~/Homo_sapiens.GRCh38.98.gtf')
snowparam <- SnowParam(workers = 6) 

ods <- OutriderDataSet(countData = count_data)
ods <- filterExpression(x = ods, gtfFile = gtf_txdb, filterGenes = T, savefpkm = T)
ods <- ods[mcols(ods)$passedFilter,]

sampleExclusionMask(ods) <- FALSE
sampleExclusionMask(ods[,c(1,2)]) <- TRUE

ods <- findEncodingDim(ods = ods,
                       BPPARAM = snowparam)

findEncodingDim stops with this error:

Error: BiocParallel errors
  element index: 1, 2, 3, 4, 5, 6, ...
  first error: all(is.finite(value)) is not TRUE
In addition: Warning message:
stop worker failed:
  attempt to select less than one element in OneIndex 

The error occurs when either using SerialParam, MultiCoreParam or SnowParam and I could reproduce it on two different systems. The error seems to occur only when setting sampleExclusionMask and using the default value for the params argument in findEncodingDim, e.g. when using params = c(4,8) there is no error.

Best,
Christian

Hi Christian,
Thanks, you can send me the count matrix to yepez 'at' in.tum.de.
We are going to try to reproduce your issue and let you know once we find the bug.
Best,
Vicente

The email should be on its way.

Hi Chris,

unfortunately we could not reproduce your error yet.
Could you please send your 'sessionInfo()'

Thanks

Sure, sessionInfo() for both system that can reproduce the error:

  • system I:
R version 3.6.2 (2019-12-12)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Manjaro Linux

Matrix products: default
BLAS:   /usr/lib/libblas.so.3.9.0
LAPACK: /usr/lib/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=de_DE.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] OUTRIDER_1.4.0              data.table_1.12.6           SummarizedExperiment_1.16.0 DelayedArray_0.12.0        
 [5] matrixStats_0.55.0          GenomicFeatures_1.38.0      AnnotationDbi_1.48.0        Biobase_2.46.0             
 [9] GenomicRanges_1.38.0        GenomeInfoDb_1.22.0         IRanges_2.20.0              S4Vectors_0.24.0           
[13] BiocGenerics_0.32.0         BiocParallel_1.20.0        

loaded via a namespace (and not attached):
  [1] colorspace_1.4-1         htmlTable_1.13.2         XVector_0.26.0           base64enc_0.1-3          rstudioapi_0.10         
  [6] bit64_0.9-7              codetools_0.2-16         splines_3.6.2            PRROC_1.3.1              geneplotter_1.64.0      
 [11] knitr_1.26               zeallot_0.1.0            Formula_1.2-3            jsonlite_1.6             Rsamtools_2.2.1         
 [16] annotate_1.64.0          cluster_2.1.0            dbplyr_1.4.2             pheatmap_1.0.12          compiler_3.6.2          
 [21] httr_1.4.1               backports_1.1.5          assertthat_0.2.1         Matrix_1.2-18            lazyeval_0.2.2          
 [26] acepack_1.4.1            htmltools_0.4.0          prettyunits_1.0.2        tools_3.6.2              gtable_0.3.0            
 [31] glue_1.3.1               GenomeInfoDbData_1.2.2   dplyr_0.8.3              rappdirs_0.3.1           Rcpp_1.0.3              
 [36] vctrs_0.2.0              Biostrings_2.54.0        gdata_2.18.0             rtracklayer_1.46.0       iterators_1.0.12        
 [41] xfun_0.11                stringr_1.4.0            lifecycle_0.1.0          renv_0.7.1-20            gtools_3.8.1            
 [46] XML_3.98-1.20            dendextend_1.12.0        MASS_7.3-51.4            zlibbioc_1.32.0          scales_1.1.0            
 [51] TSP_1.1-7                pcaMethods_1.78.0        hms_0.5.2                RColorBrewer_1.1-2       BBmisc_1.11             
 [56] curl_4.2                 memoise_1.1.0            heatmaply_0.16.0         gridExtra_2.3            ggplot2_3.2.1           
 [61] biomaRt_2.42.0           rpart_4.1-15             latticeExtra_0.6-28      stringi_1.4.3            RSQLite_2.1.2           
 [66] genefilter_1.68.0        gclus_1.3.2              foreach_1.4.7            checkmate_1.9.4          seriation_1.2-8         
 [71] caTools_1.17.1.2         rlang_0.4.2              pkgconfig_2.0.3          bitops_1.0-6             lattice_0.20-38         
 [76] purrr_0.3.3              GenomicAlignments_1.22.1 htmlwidgets_1.5.1        bit_1.1-14               tidyselect_0.2.5        
 [81] plyr_1.8.4               magrittr_1.5             DESeq2_1.26.0            R6_2.4.1                 gplots_3.0.1.1          
 [86] Hmisc_4.3-0              DBI_1.0.0                pillar_1.4.2             foreign_0.8-72           survival_3.1-8          
 [91] RCurl_1.95-4.12          nnet_7.3-12              tibble_2.1.3             crayon_1.3.4             KernSmooth_2.23-16      
 [96] BiocFileCache_1.10.2     plotly_4.9.1             viridis_0.5.1            progress_1.2.2           locfit_1.5-9.1          
[101] grid_3.6.2               blob_1.2.0               digest_0.6.23            webshot_0.5.2            xtable_1.8-4            
[106] tidyr_1.0.0              openssl_1.4.1            munsell_0.5.0            registry_0.5-1           viridisLite_0.3.0       
[111] askpass_1.1
  • system II:
R version 3.6.0 (2019-04-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices datasets  utils    
[8] methods   base     

other attached packages:
 [1] OUTRIDER_1.4.0              data.table_1.12.6          
 [3] SummarizedExperiment_1.16.0 DelayedArray_0.12.0        
 [5] matrixStats_0.55.0          GenomicFeatures_1.38.0     
 [7] AnnotationDbi_1.48.0        Biobase_2.46.0             
 [9] GenomicRanges_1.38.0        GenomeInfoDb_1.22.0        
[11] IRanges_2.20.0              S4Vectors_0.24.0           
[13] BiocGenerics_0.32.0         BiocParallel_1.20.0        

loaded via a namespace (and not attached):
  [1] colorspace_1.4-1         htmlTable_1.13.2         XVector_0.26.0          
  [4] base64enc_0.1-3          rstudioapi_0.10          bit64_0.9-7             
  [7] codetools_0.2-16         splines_3.6.0            PRROC_1.3.1             
 [10] geneplotter_1.64.0       knitr_1.26               zeallot_0.1.0           
 [13] Formula_1.2-3            jsonlite_1.6             Rsamtools_2.2.1         
 [16] annotate_1.64.0          cluster_2.1.0            dbplyr_1.4.2            
 [19] pheatmap_1.0.12          compiler_3.6.0           httr_1.4.1              
 [22] backports_1.1.5          assertthat_0.2.1         Matrix_1.2-17           
 [25] lazyeval_0.2.2           acepack_1.4.1            htmltools_0.4.0         
 [28] prettyunits_1.0.2        tools_3.6.0              gtable_0.3.0            
 [31] glue_1.3.1               GenomeInfoDbData_1.2.2   dplyr_0.8.3             
 [34] rappdirs_0.3.1           Rcpp_1.0.3               vctrs_0.2.0             
 [37] Biostrings_2.54.0        gdata_2.18.0             rtracklayer_1.46.0      
 [40] iterators_1.0.12         xfun_0.11                stringr_1.4.0           
 [43] lifecycle_0.1.0          renv_0.8.3               gtools_3.8.1            
 [46] XML_3.98-1.20            dendextend_1.12.0        MASS_7.3-51.4           
 [49] zlibbioc_1.32.0          scales_1.1.0             TSP_1.1-7               
 [52] pcaMethods_1.78.0        hms_0.5.2                RColorBrewer_1.1-2      
 [55] BBmisc_1.11              curl_4.2                 memoise_1.1.0           
 [58] heatmaply_0.16.0         gridExtra_2.3            ggplot2_3.2.1           
 [61] biomaRt_2.42.0           rpart_4.1-15             latticeExtra_0.6-28     
 [64] stringi_1.4.3            RSQLite_2.1.2            genefilter_1.68.0       
 [67] gclus_1.3.2              foreach_1.4.7            checkmate_1.9.4         
 [70] seriation_1.2-8          caTools_1.17.1.2         rlang_0.4.1             
 [73] pkgconfig_2.0.3          bitops_1.0-6             lattice_0.20-38         
 [76] purrr_0.3.3              GenomicAlignments_1.22.1 htmlwidgets_1.5.1       
 [79] bit_1.1-14               tidyselect_0.2.5         plyr_1.8.4              
 [82] magrittr_1.5             DESeq2_1.26.0            R6_2.4.1                
 [85] gplots_3.0.1.1           Hmisc_4.3-0              DBI_1.0.0               
 [88] pillar_1.4.2             foreign_0.8-72           survival_2.44-1.1       
 [91] RCurl_1.95-4.12          nnet_7.3-12              tibble_2.1.3            
 [94] crayon_1.3.4             KernSmooth_2.23-15       BiocFileCache_1.10.2    
 [97] plotly_4.9.1             viridis_0.5.1            progress_1.2.2          
[100] locfit_1.5-9.1           grid_3.6.0               blob_1.2.0              
[103] digest_0.6.22            webshot_0.5.1            xtable_1.8-4            
[106] tidyr_1.0.0              openssl_1.4.1            munsell_0.5.0           
[109] registry_0.5-1           viridisLite_0.3.0        askpass_1.1 

I just tried setting up a centos 7 based singularity container and within the container I'm also able to reproduce the error. But I guess it's not too helpful for debugging.

Should I try to set up something like a VirtualBox image that can reproduce the error?

loipf commented

Hello, Christian,

The bug is fixed in the latest github version from a few hours ago.

It depended a bit on the injected outliers and could be reproduced on the second run. Also, it only occurred for certain dimensions, which took some time to figure out. The error was a bug with the exclusion mask, which resulted in values that were too high and were considered infinite. This problem is solved with cc94d3a

Thanks for the detailed information and reports. Nice to see you working with OUTRIDER.

Best wishes,
Stefan

Hi Stefan,

yes, can confirm the current github version fixes the bug for me as well, thanks!

loipf commented

great perfect !